Emergency Reclaim (Process-Match Kill Switch)¶

Status: Shipped — trigger contract, node-side agent (both rung 1 /proc poll and rung 2 netlink proc connector), CRD field, controller-side EmergencyRemove dispatch, and opt-in DaemonSet are all live. See the roadmap for the full design and the per-phase ship dates.

Why emergency reclaim exists¶

ScheduledMachine is built for time-based scheduling: a workstation joins the cluster at 9 AM, leaves at 5 PM, next morning it joins again. That model assumes the human owner of the machine does not need it back right now — they'll get it back at the next schedule boundary.

Real-world usage breaks that assumption. When a developer whose workstation doubles as a cluster node opens IntelliJ or fires up a JVM, the schedule is irrelevant: they want the box back immediately. Waiting for the next schedule boundary — or even a graceful drain (tens of seconds to minutes) — is unacceptable while the user is staring at a spinning JVM startup and the cluster is eating their cores.

Emergency reclaim is the nuclear eject path: a node-side agent watches for a user-declared set of process names / argv patterns, and on first match, rips the node out of the cluster as fast as physically possible.

Two kill switches, one family¶

5-Spot ships two emergency-remove mechanisms. They share the same "skip graceful drain, eject now" semantics but differ in who pulls the trigger:

Mechanism	Trigger	Phase	Side effect	Reset
`spec.killSwitch: true`	Operator edits the `ScheduledMachine`	`Terminated`	None — machine stays terminated until operator resets	Operator sets `killSwitch: false`
`spec.killIfCommands: [...]` (this page)	Agent on the node sees a matching process	`EmergencyRemove`	Controller flips `spec.enabled=false` so the machine does not auto-rejoin at the next schedule boundary	Operator (or the user whose workstation it is) sets `spec.enabled: true` — see User-facing re-enable flow below

Both are opt-in: killSwitch defaults to false, and killIfCommands is absent by default (no agent installed on the node).

The trigger chain¶

flowchart LR A["User launches matching process (e.g. java)"] B["/proc/<pid>/comm + /proc/<pid>/cmdline"] C["5spot-reclaim-agent (DaemonSet pod)"] D["PATCH Node metadata.annotations"] E["5-Spot Controller (watch on Node)"] F["Phase::EmergencyRemove on ScheduledMachine"] G["kubectl drain --grace-period=0 --force --disable-eviction"] H["Delete CAPI Machine (no graceful shutdown)"] I["PATCH ScheduledMachine spec.enabled=false"] J["Clear reclaim annotations from Node"] A --> B --> C --> D --> E --> F --> G --> H --> I --> J style A fill:#f9d5a7,stroke:#e17b4b,stroke-width:2px,color:#000 style C fill:#c8d8e8,stroke:#4f6d91,stroke-width:2px,color:#000 style E fill:#c8d8e8,stroke:#4f6d91,stroke-width:2px,color:#000 style F fill:#ffc9b5,stroke:#e17b4b,stroke-width:2px,color:#000 style I fill:#ffc9b5,stroke:#e17b4b,stroke-width:2px,color:#000

Each hop's responsibility stays narrow: the agent only writes annotations (node-scoped RBAC, no broad API access); the controller owns all cluster-mutating work (drain, Machine delete, schedule flip).

Lifecycle — where `EmergencyRemove` fits¶

The new phase plugs into the existing ScheduledMachine state machine as a branch off any "alive" phase:

stateDiagram-v2 [*] --> Pending: Resource created Pending --> Active: Schedule active & resources created Pending --> Inactive: Outside schedule window Pending --> Disabled: spec.enabled = false Active --> ShuttingDown: Schedule ends Active --> Terminated: killSwitch = true Active --> EmergencyRemove: killIfCommands match on node Active --> Error: Provisioning error Pending --> EmergencyRemove: killIfCommands match on node ShuttingDown --> EmergencyRemove: killIfCommands match on node ShuttingDown --> Inactive: Grace period complete ShuttingDown --> Error: Shutdown error Inactive --> Active: Schedule window starts Inactive --> Disabled: spec.enabled = false Disabled --> Pending: spec.enabled = true Terminated --> Pending: killSwitch = false EmergencyRemove --> Disabled: Eject complete (spec.enabled auto-flipped to false) Error --> Pending: Recovery / Retry note right of EmergencyRemove Non-graceful eject in progress gracefulShutdownTimeout ignored nodeDrainTimeout ignored end note note right of Disabled Exit path from EmergencyRemove. User sets spec.enabled=true to return the node to service. end note

Why the exit is `Disabled`, not `Pending`¶

This is the piece that makes emergency reclaim useful instead of infuriating. When the eject completes, the controller sets spec.enabled=false on the owning ScheduledMachine. Without that flip:

Eject succeeds, node leaves cluster.
Next schedule window opens (e.g. 9 AM the next morning).
Controller sees "in-schedule, no machine exists" and re-creates the Machine → node rejoins.
User's JVM is still running.
Agent sees match → annotates → controller ejects again.
Loop repeats every schedule window forever.

By flipping enabled=false the controller makes the user's explicit re-enable the trigger to return the node to service — which is the behavior a human actually wants.

Sequence of operations¶

sequenceDiagram autonumber participant User as User / JVM participant Proc as /proc participant Agent as 5spot-reclaim-agent (DaemonSet pod) participant NodeObj as Node object (k8s API) participant Ctrl as 5-Spot Controller participant SM as ScheduledMachine participant Machine as CAPI Machine User->>Proc: exec("java ...") Agent->>Proc: poll /proc/*/comm + cmdline (every poll_interval_ms, default 250ms) Proc-->>Agent: java matches match_commands Note over Agent: First match wins — no retry loop, no wait for "reclaim complete" Agent->>NodeObj: PATCH metadata.annotations 5spot.finos.org/reclaim-requested=true 5spot.finos.org/reclaim-reason=... 5spot.finos.org/reclaim-requested-at=... Agent->>Agent: Exit 0 (idempotent) Ctrl->>NodeObj: Watch event: Node updated Ctrl->>Ctrl: node_reclaim_request(node) = Some(ReclaimRequest) Ctrl->>SM: status.phase = EmergencyRemove Emit Event (reason: EmergencyReclaim) Ctrl->>NodeObj: kubectl drain --grace-period=0 --force --disable-eviction Ctrl->>Machine: Delete (immediate) Machine-->>Ctrl: Deletion confirmed Ctrl->>SM: PATCH spec.enabled = false Emit Event (reason: EmergencyReclaimDisabledSchedule) Ctrl->>NodeObj: Clear reclaim annotations (merge-patch with null values) Ctrl->>SM: status.phase = Disabled Note over Ctrl,SM: Node stays out of cluster. User manually re-enables when ready.

Why this ordering matters:

Annotations are cleared last, not first. If the user reboots or the agent restarts mid-eject, the annotation is still there, so the controller keeps processing — the operation is idempotent.
spec.enabled=false is set before annotation clear. This is the load-bearing step. If we cleared the annotation first and then crashed, the next reconcile would see no annotation, not flip enabled, and fall back to normal scheduling.
Machine deletion is immediate (--grace-period=0). No PDB respect, no gracefulShutdownTimeout wait, no nodeDrainTimeout — the whole point is that the user needs their box back now.

Opt-in installation¶

The reclaim agent is not installed on every node. Both gates below must be satisfied for the agent to land on a node:

flowchart TD A["ScheduledMachine.spec.killIfCommands is non-empty?"] B["Controller stamps node label 5spot.finos.org/reclaim-agent=enabled"] C["Controller projects per-node ConfigMap reclaim-agent-<node> with the killIfCommands list"] D["DaemonSet nodeSelector matches the label"] E["Agent pod lands on node watches its per-node ConfigMap via the kube API and hot-reloads"] F["No agent on this node (zero footprint)"] A -->|yes| B A -->|no / empty| F B --> C C --> D D --> E style A fill:#c8d8e8,stroke:#4f6d91,stroke-width:2px,color:#000 style E fill:#a7d5a7,stroke:#4f6d91,stroke-width:2px,color:#000 style F fill:#d0d0d0,stroke:#666,stroke-width:1px,color:#000

Why two gates:

Node label (5spot.finos.org/reclaim-agent=enabled) — gates scheduling. An operator cannot accidentally install the agent on nodes that have not opted in; the DaemonSet simply will not schedule.
Per-node ConfigMap — gates behavior. Even if an agent pod is running, a config with both match lists empty returns None every scan and never annotates. This is the secondary safety net: "everyone has the agent installed with empty match list" is made impossible by construction because the operator has to type the patterns into spec.killIfCommands to get anything to happen.

For regulated environments, this double-gating keeps the node-side surface proportional to declared intent — and the binary with elevated privileges (hostPID: true, reads every pid in /proc) never lands on nodes that have not declared they want it.

MVP limitation

The 2026-04-20 MVP ships the DaemonSet + RBAC but not the controller-side label-stamp and per-node ConfigMap projection. Operators who want to try the agent today must label nodes and create the ConfigMap manually. See the roadmap for the remaining work.

Configuring `killIfCommands`¶

Declare the match patterns on the ScheduledMachine spec:

apiVersion: 5spot.finos.org/v1beta1
kind: ScheduledMachine
metadata:
  name: dev-workstation-01
spec:
  enabled: true

  # Required reference to a spot-schedule provider (ADR 0009). The day/hour
  # window — e.g. "join after hours, Mon-Fri 19:00-23:00 Eastern" — lives on the
  # referenced TimeBasedSpotSchedule "after-hours" in this namespace.
  schedule:
    apiVersion: spotschedules.5spot.finos.org/v1alpha1
    kind: TimeBasedSpotSchedule
    name: after-hours

  clusterName: dev-cluster

  # Emergency reclaim: pull this node out the moment the user opens
  # any of these processes. Matched against /proc/*/comm (exact basename)
  # and /proc/*/cmdline (substring).
  killIfCommands:
    - java
    - idea
    - steam

  bootstrapSpec:
    apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
    kind: K0sWorkerConfig
    spec:
      version: v1.30.0+k0s.0

  infrastructureSpec:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: RemoteMachine
    spec:
      address: 192.168.1.100
      port: 22
      user: admin
      useSudo: true

Matching semantics¶

Source	Kind	Example match
`/proc/<pid>/comm`	Exact basename, case-sensitive, max 15 chars (kernel limit)	`java` matches a process whose comm is literally `java`. Does not match `Java` or `java-wrapper`.
`/proc/<pid>/cmdline`	Substring of the NUL-separated argv joined with spaces	`idea` matches `/opt/idea/bin/idea.sh -Xmx2g`. Substring is case-sensitive.

A config with both match lists empty (or killIfCommands: []) is inert — the agent returns None every scan and never annotates. Absent is equivalent to empty.

Poll interval¶

The agent polls /proc every poll_interval_ms (default 250 ms — see DEFAULT_POLL_INTERVAL_MS in src/reclaim_agent.rs). This is the worst-case detection latency on rung 1 (/proc poll).

Detector — rung 1 (`/proc` poll) vs rung 2 (netlink)¶

The agent ships two detection back-ends. Both produce identical Match values, both go through the same Node-PATCH path; the only difference is the event source.

Mode	What it does	Latency	Idle CPU	Linux only?	Extra capability
`poll` (rung 1)	Walks `/proc` every `poll_interval_ms`	up to one poll interval (250 ms default)	~0 (one syscall per pid per tick)	No — works wherever `/proc` exists	None
`netlink` (rung 2)	Subscribes to the kernel's proc connector (`PROC_EVENT_EXEC`); kernel pushes one message per `exec(2)`	<10 ms (kernel-pushed)	sleeps until the kernel wakes it	Yes — Linux only	`CAP_NET_ADMIN`

Selection:

--detector=auto       # default. Linux → netlink, anywhere else → poll.
--detector=netlink    # force netlink. Errors loudly on non-Linux or
                      # without CAP_NET_ADMIN.
--detector=poll       # force the rung-1 poll loop on any platform.

Override at deploy time via RECLAIM_DETECTOR={auto|netlink|poll} on the DaemonSet pod.

Tradeoff. netlink is lower latency and lower idle CPU in steady state. Under heavy-exec workloads (make -j32, compile farms) it can be more expensive than poll, because rung 2 sees every short-lived process whereas rung 1 only sees those that survive to the next tick. If the operator's fleet runs exec storms, pin --detector=poll explicitly; otherwise keep the auto default.

The DaemonSet manifest grants CAP_NET_ADMIN once at the pod level (see deploy/node-agent/daemonset.yaml), so an operator switching --detector=poll → netlink (or vice versa) does not need to re-roll the pod's security context. The cap is unused on poll; the kernel never sees a netlink syscall in that mode.

Kernel and capability requirements¶

Rung 1 (poll) needs essentially nothing — /proc exists on every Linux kernel and the agent already has UID 0 + host /proc mount for other reasons. It also runs unmodified in a non-Linux container if you ever need to (the Subscriber constructor reports "unsupported" and the bin falls back automatically when --detector=auto is in effect).

Rung 2 (netlink) has three runtime prerequisites:

Requirement	Provided by	What happens if missing
Kernel built with `CONFIG_PROC_EVENTS=y`	Distro kernel build (Debian, Ubuntu, RHEL/CentOS, Alpine, Talos, Bottlerocket, Chainguard host kernels — all default `y`)	The netlink socket opens cleanly but the kernel never pushes events. Agent sits idle, never matches. No userspace error to detect this. Diagnose by `zgrep CONFIG_PROC_EVENTS /proc/config.gz` on the host; mitigate by `--detector=poll`.
`CAP_NET_ADMIN` on the agent container	`deploy/node-agent/daemonset.yaml` `securityContext.capabilities.add: [NET_ADMIN]`	`Subscriber::new()` fails at startup with `EPERM`; the bin exits non-zero and `kubelet` restarts it (crash loop until the cap is granted). `--detector=poll` works without it.
Linux kernel ≥ 2.6.15	Effectively guaranteed on any cluster you'd actually deploy 5-Spot to (released 2006)	Same as missing `CONFIG_PROC_EVENTS` — silent.

The cap chain is straightforward:

deploy/node-agent/daemonset.yaml
  └─ container.securityContext.capabilities
       ├─ drop: [ALL]                 # default-deny
       └─ add: [NET_ADMIN]            # rung 2 only — rung 1 doesn't use it

This is the only added capability. UID 0 + host /proc + hostPID were already required for rung 1 (the agent has to read every process's /proc/<pid>/{comm,cmdline} under hidepid=2). See .trivyignore for the architectural-necessity rationale on each Trivy / Semgrep finding the configuration trips.

CAP_NET_ADMIN is broader than just netlink-connector access — it also grants ability to edit routes, iptables, etc. on the host network namespace. The risk is bounded by:

Opt-in nodeSelector — the cap is granted only on Nodes labelled 5spot.finos.org/reclaim-agent: enabled, which the controller stamps only when a ScheduledMachine on that Node has a non-empty spec.killIfCommands. A cluster that never uses killIfCommands never has the cap on any pod.
Single-purpose binary — the agent has no shell, no exec of children, no remote-control surface. Compromising it requires swapping the binary on disk, at which point an attacker could grant any cap they wanted anyway.
Operator opt-out — pin --detector=poll (env RECLAIM_DETECTOR=poll) and the cap is granted-but-unused. Operators with PSA restricted profiles or capability allowlists can deploy in this mode and never need the cap.

See the threat model for the full STRIDE table on the reclaim agent.

User-facing re-enable flow¶

When the user is ready to return the node to scheduled service:

kubectl patch scheduledmachine/dev-workstation-01 \
  --type merge \
  -p '{"spec":{"enabled":true}}'

What happens next depends on whether the matched process is still running:

stateDiagram-v2 [*] --> Disabled: Emergency reclaim complete Disabled --> Pending: User sets spec.enabled=true state matched_check <<choice>> Pending --> matched_check: Re-evaluate schedule + re-assess reclaim state matched_check --> Active: In schedule window AND matched process gone matched_check --> Inactive: Outside schedule window AND matched process gone matched_check --> EmergencyRemove: Matched process still running → agent re-fires EmergencyRemove --> Disabled: Eject re-runs, enabled auto-flipped false again note right of matched_check If user re-enables while matched process still runs, controller ejects them again. Intended feedback: "quit your JVM first, then re-enable." end note

This is intentional. Re-enabling with the matched process still running is the controller's way of telling the user: "Your JVM is still eating the box. Quit it, then re-enable the machine (spec.enabled=true)."

Permanently opting out¶

If the user wants to permanently remove the node from the emergency-reclaim path (e.g. they changed their mind about using the workstation as a cluster node), clear spec.killIfCommands:

kubectl patch scheduledmachine/dev-workstation-01 \
  --type merge \
  -p '{"spec":{"killIfCommands":null}}'

This triggers the controller to strip the node label and delete the per-node ConfigMap, which tears the DaemonSet pod off the node. Time-based scheduling resumes as normal.

Observing the emergency-reclaim path¶

Kubernetes Events¶

The controller emits two distinct Events on the ScheduledMachine:

Reason	Emitted at	Meaning
`EmergencyReclaim`	Start of `EmergencyRemove` phase	Agent annotation observed; entering non-graceful eject. Event message includes the reason string from the annotation (e.g. `process-match: java`).
`EmergencyReclaimDisabledSchedule`	After successful eject, before annotation clear	`spec.enabled` has been flipped to `false`. Event message tells the operator how to re-enable.

View them with:

kubectl describe scheduledmachine/<name> | grep -A 20 Events

Status condition¶

A condition is written on the ScheduledMachine status so that kubectl get scheduledmachine -o yaml surfaces why the schedule is disabled:

status:
  phase: Disabled
  conditions:
    - type: Scheduled
      status: "False"
      reason: EmergencyReclaimDisabledSchedule
      message: "Machine auto-disabled (spec.enabled=false) by emergency reclaim (process-match: java). Re-enable with kubectl patch when ready."
      lastTransitionTime: "2026-04-20T21:45:00Z"

Without this condition, a human reading the object a week later would have no way to tell why enabled is false — whether the operator set it manually, whether there was an emergency reclaim, or whether it has always been off.

Node annotations (transient)¶

While the eject is in flight, the Node object carries the three reclaim annotations. They are cleared as the last step of the handler, so steady-state shows no trace on the Node. If you catch one mid-eject:

kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jq
# {
#   "5spot.finos.org/reclaim-requested": "true",
#   "5spot.finos.org/reclaim-reason": "process-match: java",
#   "5spot.finos.org/reclaim-requested-at": "2026-04-20T21:45:00Z"
# }

The annotation value "5spot.finos.org/reclaim-requested" is checked as a strict literal "true" — any other value ("false", "0", empty, missing) is treated as not-requested, which avoids partial-write foot-guns.

Agent logs¶

Each DaemonSet pod logs at info level on match and annotation write. From the host:

kubectl logs -n 5spot-system -l app=5spot-reclaim-agent --tail=50 | jq

What is not part of emergency reclaim¶

These are out of scope — recorded here because operators often ask:

Killing the matched process. 5-Spot never touches the user's process. The user started it on purpose; killing their JVM to "help them get their node back" is the opposite of the contract. Agent sees, signals, exits.
"Reclaim only while the process is running" semantics. Match is a trigger, not a live predicate. If the user kill -9s their JVM after the annotation is written but before the drain completes, the node still gets reclaimed. Partial-eject-and-rollback semantics would make the state machine significantly more complex for no real-world benefit.
Cross-node reclaim. Each agent is scoped to its own node. A user launching Java on node A does not reclaim node B.
Auto-resume when the process exits. See Open Question 6 in the roadmap. Explicit re-enable by the user is the MVP choice; revisit only if operator feedback demands it.

Host-identity verification¶

Before the agent PATCHes any Node with reclaim annotations, it cross-checks the host's /etc/machine-id against the target Node's status.nodeInfo.machineID. Both are populated from the same source (systemd-machine-id-setup on host boot; kubelet on Node registration), so they agree on a healthy node — and a mismatch is a strong signal that the agent is about to write to the wrong Node.

Why¶

This closes the modified-DaemonSet → impersonate-victim-Node path:

Without the check, the agent trusts whatever value NODE_NAME carries (sourced from the downward API: spec.nodeName).
An attacker with update daemonsets in 5spot-system can override NODE_NAME to a hard-coded value (e.g. victim-node).
The agent runs on host A, sees a process match, and PATCHes victim-node instead of host A — triggering Phase::EmergencyRemove on a node it has no business reclaiming.

The cross-check makes the impersonation visible: the spoofed Node's status.nodeInfo.machineID belongs to a different host, so the comparison fails and the agent refuses to write.

The exploit precondition (update daemonsets) is cluster-admin in most clusters, so this is defence-in-depth — but the check is cheap and removes a category of impersonation. Filed as Phase 4 of the 2026-04-25 security audit roadmap.

Modes¶

Mode	Flag / env	Behaviour
Strict (default)	`--skip-host-id-check=false` / `SKIP_HOST_ID_CHECK=false`	Read `/etc/machine-id` at startup; fail to start if missing or empty. Before each PATCH, fetch the target Node and refuse if `status.nodeInfo.machineID` does not match.
Bypass	`--skip-host-id-check=true` / `SKIP_HOST_ID_CHECK=true`	Trust `NODE_NAME` blindly (pre-Phase-4 behaviour). Use only when `/etc/machine-id` is genuinely unavailable: containers without the host file mounted, dev sandboxes, kubelet variants that do not populate `status.nodeInfo.machineID`.

How `/etc/machine-id` is exposed¶

The DaemonSet mounts the host file as a single read-only file (not the whole /etc):

volumeMounts:
  - name: host-machine-id
    mountPath: /host/etc/machine-id
    readOnly: true
volumes:
  - name: host-machine-id
    hostPath:
      path: /etc/machine-id
      type: File

type: File makes kubelet refuse to schedule the pod if the host file is missing — fail-fast is preferable to silently degrading to skip-check mode. The agent reads the path from MACHINE_ID_PATH (default /host/etc/machine-id when deployed via the manifest).

What it does NOT defend against¶

A compromise that lets an attacker modify both the DaemonSet and the host's /etc/machine-id (would require root on the node).
Kubelet bugs or out-of-band manipulation of Node.status.nodeInfo.machineID. The kubelet writes this field; if the cluster trusts a misbehaving kubelet, identity verification has no anchor.

Both gaps are out of scope for a node-side agent; remediating them requires a stronger node-attestation primitive (TPM-backed identity, SPIFFE SVIDs, etc.) — not in this controller's scope. A phased design for closing both gaps (TPM-quoted machine-id → SPIRE tpm_devid attestation → controller-side SVID verification) is tracked in the 5spot-strong-node-attestation roadmap.

Machine Lifecycle — full phase state machine including EmergencyRemove
ScheduledMachine — CRD reference including killIfCommands and killSwitch
Troubleshooting — debugging the kill switch
Roadmap — Emergency Node Reclaim by Process Match — design rationale, phase breakdown, open questions