Skip to content

Architecture Flows

FINOS CALM

Auto-generated

Rendered from docs/architecture/calm/architecture.json by the CALM CLI (calm template). Do not edit this file by hand — edit the architecture JSON or the Handlebars template at docs/architecture/calm/templates/mermaid/flows.md.hbs and regenerate with make calm-diagrams.

Each business flow defined in the CALM architecture is rendered below as its own Mermaid flowchart TD — one diagram per flow, linking the transitions in sequence order.

Schedule Activation (machine enters window)

When the current time enters a ScheduledMachine's active window, 5-Spot creates the bootstrap + infrastructure resources and a CAPI Machine so that CAPI provisions the physical node and joins it to the workload cluster.

flowchart TD t1["1. Operator applies a ScheduledMachine CR declaring schedule, bootstrap spec, infrastructure spec, and cluster name."] t2["2. Controller watch observes the CR; scheduler evaluates daysOfWeek/hoursOfDay against current time in the configured timezone and finds it in window."] t3["3. Controller creates K0sWorkerConfig (bootstrap), RemoteMachine (infra), and a CAPI Machine referencing both; phase -> Scheduled."] t4["4. CAPI core reconciles the Machine and hands off to bootstrap + infra providers."] t5["5. Infrastructure provider SSHes to the physical node, installs k0s with bootstrap data, and joins the workload cluster; phase -> Active."] t1 --> t2 --> t3 --> t4 --> t5

Source: flow flow-schedule-activation in architecture.json.

Schedule Deactivation (machine exits window or kill switch)

When the window closes (or killSwitch is set), the controller cordons the node, evicts pods within nodeDrainTimeout, then deletes the CAPI Machine so the infra provider tears it down.

flowchart TD t1["1. Scheduler tick detects out-of-window (or killSwitch=true); phase -> Removing; gracefulShutdownTimeout starts."] t2["2. Controller cordons the node (patch spec.unschedulable=true) and creates pod eviction requests, respecting nodeDrainTimeout."] t3["3. Controller deletes the CAPI Machine (and associated bootstrap/infra objects if owned); phase -> Inactive / UnScheduled."] t4["4. CAPI core propagates deletion to providers; physical node is removed from the workload cluster."] t1 --> t2 --> t3 --> t4

Source: flow flow-schedule-deactivation in architecture.json.

Emergency Reclaim (process-match eject)

A matched process on a workload node triggers an ASAP eject that skips graceful drain. The 7-step ordering contract is load-bearing: spec.schedule.enabled=false (step 5) must be written BEFORE the annotation clear (step 6) so a crash between them is replay-safe. Full narrative in docs/src/concepts/emergency-reclaim.md.

flowchart TD t1["1. Agent detects a process matching spec.killIfCommands via /proc scan and PATCHes the reclaim annotation triple (5spot.finos.org/reclaim-requested="true", /reclaim-reason, /reclaim-requested-at) onto its own Node via the kubelet node-scoped token. Field manager: 5spot-reclaim-agent."] t2["2. Controller Node watch fires. check_emergency_reclaim reads the annotation, transitions status.phase = EmergencyRemove, and emits Event Reason: EmergencyReclaim with the annotation's reason string."] t3["3. Best-effort kubectl drain --grace-period=0 --force --disable-eviction against the workload cluster, bounded by EMERGENCY_DRAIN_TIMEOUT_SECS=60. Failure is log-and-continue — the eject has already been committed to."] t4["4. Controller deletes the CAPI Machine on the management cluster — no PDB respect, no graceful shutdown. CAPI propagates deletion to providers and the physical node leaves the workload cluster."] t5["5. Controller PATCHes ScheduledMachine spec.schedule.enabled=false (the loop-breaker: without this, the next schedule window re-adds the node and the agent re-fires forever) and emits Event Reason: EmergencyReclaimDisabledSchedule. Load-bearing: must run BEFORE step 6."] t6["6. Controller PATCHes the Node to null all three reclaim annotations. Best-effort: a crash between steps 5 and 6 is replay-safe because the agent re-annotates on the next poll cycle if the matched process is still running."] t7["7. Controller transitions ScheduledMachine status.phase = Disabled. Re-enable is a manual operator action (kubectl patch ... spec.schedule.enabled=true)."] t1 --> t2 --> t3 --> t4 --> t5 --> t6 --> t7

Source: flow flow-emergency-reclaim in architecture.json.