Architecture¶
5-Spot is built as a Kubernetes controller using the kube-rs framework.
High-Level Architecture¶
flowchart TB
subgraph Kubernetes["Kubernetes Cluster"]
API[Kubernetes API Server]
subgraph CRDs["5-Spot CRDs"]
SM[ScheduledMachine]
end
subgraph CAPI["Cluster API Resources"]
Machine[Machine]
Bootstrap[Bootstrap Config<br/>e.g., K0sWorkerConfig]
Infra[Infrastructure<br/>e.g., RemoteMachine]
end
Node[Node]
end
subgraph Operator["5-Spot Controller"]
Controller[Controller Loop]
Reconciler[Reconciler]
Scheduler[Schedule Evaluator]
Creator[Resource Creator]
end
subgraph External["External"]
PhysicalMachine[Physical Machine]
end
Controller -->|watch| API
API --> SM
Reconciler --> Scheduler
Reconciler --> Creator
Creator -->|create/delete| Bootstrap
Creator -->|create/delete| Infra
Creator -->|create/delete| Machine
Machine --> Node
Node -.->|joins| PhysicalMachine
SM -->|owns| Bootstrap
SM -->|owns| Infra
SM -->|owns| Machine
Component Details¶
Controller¶
The main entry point that:
- Watches for ScheduledMachine resource changes
- Manages reconciliation queue
- Handles multi-instance distribution via consistent hashing
- Provides health and metrics endpoints
Reconciler¶
Implements the reconciliation loop:
- Fetch current ScheduledMachine state
- Check for kill switch or disabled schedule
- Evaluate schedule against current time (in configured timezone)
- Determine desired state (active/inactive)
- Create or delete CAPI resources as needed
- Update status, conditions, and references
Schedule Evaluator¶
Evaluates time-based schedules:
- Parses cron expressions (when specified)
- Parses day ranges (e.g.,
mon-fri, with wrap-around support) - Parses hour ranges (e.g.,
9-17, with wrap-around support) - Handles timezone conversions using IANA timezone database
- Determines if current time is within schedule
Resource Creator¶
Creates and manages CAPI resources from inline specs:
- Creates Bootstrap resource from
bootstrapSpec - Creates Infrastructure resource from
infrastructureSpec - Creates CAPI Machine with references to both
- Sets owner references for automatic garbage collection
- Tracks created resource references in status
Resource Creation Flow¶
flowchart TD
SM[ScheduledMachine] --> |contains| BS[bootstrapSpec<br/>inline config]
SM --> |contains| IS[infrastructureSpec<br/>inline config]
subgraph Creation["When Schedule Active"]
BS --> |creates| BR[Bootstrap Resource<br/>e.g., K0sWorkerConfig]
IS --> |creates| IR[Infrastructure Resource<br/>e.g., RemoteMachine]
BR --> |referenced by| M[CAPI Machine]
IR --> |referenced by| M
end
M --> |provisions| N[Node]
subgraph Status["Status Updated"]
SM -.-> |bootstrapRef| BR
SM -.-> |infrastructureRef| IR
SM -.-> |machineRef| M
SM -.-> |nodeRef| N
end
Reconciliation Flow¶
sequenceDiagram
participant API as Kubernetes API
participant Ctrl as Controller
participant Rec as Reconciler
participant Sched as Schedule Evaluator
participant Res as Resource Creator
API->>Ctrl: ScheduledMachine Event
Ctrl->>Rec: Reconcile Request
Rec->>API: Get ScheduledMachine
alt killSwitch = true
Rec->>Res: Delete all resources immediately
Rec->>API: Update status (Terminated)
else schedule.enabled = false
Rec->>API: Update status (Disabled)
else
Rec->>Sched: Evaluate Schedule
Sched-->>Rec: inSchedule: true/false
alt inSchedule = true
Rec->>Res: Ensure resources exist
Res->>API: Create Bootstrap from bootstrapSpec
Res->>API: Create Infrastructure from infrastructureSpec
Res->>API: Create Machine with refs
Rec->>API: Update status (Active)
else inSchedule = false
Rec->>Res: Remove resources
Res->>API: Delete Machine
Res->>API: Delete Bootstrap
Res->>API: Delete Infrastructure
Rec->>API: Update status (Inactive)
end
end
Rec->>API: Requeue after interval
Multi-Instance Support¶
5-Spot supports running multiple instances for high availability:
- Consistent Hashing: Resources are distributed based on name hash
- Instance ID: Each instance has a unique ID (0 to N-1)
- No Overlap: Each resource is managed by exactly one instance
- Environment Variables:
OPERATOR_INSTANCE_IDandOPERATOR_INSTANCE_COUNT
flowchart LR
subgraph Resources
R1[SM: worker-a]
R2[SM: worker-b]
R3[SM: worker-c]
R4[SM: worker-d]
R5[SM: worker-e]
R6[SM: worker-f]
end
subgraph Instances
I0[Instance 0]
I1[Instance 1]
I2[Instance 2]
end
R1 -->|hash % 3 = 0| I0
R2 -->|hash % 3 = 1| I1
R3 -->|hash % 3 = 2| I2
R4 -->|hash % 3 = 0| I0
R5 -->|hash % 3 = 1| I1
R6 -->|hash % 3 = 2| I2
Owner References & Garbage Collection¶
5-Spot uses Kubernetes owner references for automatic cleanup:
flowchart TD
SM[ScheduledMachine<br/>owner]
SM -->|ownerRef| B[Bootstrap Resource]
SM -->|ownerRef| I[Infrastructure Resource]
SM -->|ownerRef| M[CAPI Machine]
subgraph GC["Garbage Collection"]
D[Delete ScheduledMachine]
D --> DB[Bootstrap deleted]
D --> DI[Infrastructure deleted]
D --> DM[Machine deleted]
end
When a ScheduledMachine is deleted, Kubernetes automatically garbage collects all owned resources.
Data Flow¶
flowchart LR
subgraph Input
CR[ScheduledMachine CR]
Time[Current Time]
TZ[Timezone]
end
subgraph Processing
SE[Schedule Evaluation]
RC[Reconciliation]
end
subgraph Output
Resources[CAPI Resources]
Status[Status Update]
Events[Kubernetes Events]
Metrics[Prometheus Metrics]
end
CR --> SE
Time --> SE
TZ --> SE
SE --> RC
RC --> Resources
RC --> Status
RC --> Events
RC --> Metrics
Error Handling¶
| Error Type | Handling | Requeue |
|---|---|---|
| Transient API errors | Automatic retry | 30s with backoff |
| Schedule parse errors | Status updated with error | No requeue |
| Resource creation failures | Retry with backoff | Up to 5m max |
| Permanent errors | Manual intervention required | No automatic retry |
Observability¶
Health Endpoints¶
/health- Liveness probe (port 8081)/ready- Readiness probe (port 8081)
Metrics¶
/metrics- Prometheus metrics (port 8080)- Reconciliation duration, success/failure counts
- Resource counts by phase
Events¶
Kubernetes events are emitted for: - Machine creation/deletion - Schedule activation/deactivation - Errors and warnings