Emergency Reclaim (Process-Match Kill Switch)¶
Status: Shipped — trigger contract, node-side agent (both rung 1 /proc poll and rung 2 netlink proc connector), CRD field, controller-side EmergencyRemove dispatch, and opt-in DaemonSet are all live. See the roadmap for the full design and the per-phase ship dates.
Why emergency reclaim exists¶
ScheduledMachine is built for time-based scheduling: a workstation joins the cluster at 9 AM, leaves at 5 PM, next morning it joins again. That model assumes the human owner of the machine does not need it back right now — they'll get it back at the next schedule boundary.
Real-world usage breaks that assumption. When a developer whose workstation doubles as a cluster node opens IntelliJ or fires up a JVM, the schedule is irrelevant: they want the box back immediately. Waiting for the next schedule boundary — or even a graceful drain (tens of seconds to minutes) — is unacceptable while the user is staring at a spinning JVM startup and the cluster is eating their cores.
Emergency reclaim is the nuclear eject path: a node-side agent watches for a user-declared set of process names / argv patterns, and on first match, rips the node out of the cluster as fast as physically possible.
Two kill switches, one family¶
5-Spot ships two emergency-remove mechanisms. They share the same "skip graceful drain, eject now" semantics but differ in who pulls the trigger:
| Mechanism | Trigger | Phase | Side effect | Reset |
|---|---|---|---|---|
spec.killSwitch: true |
Operator edits the ScheduledMachine |
Terminated |
None — machine stays terminated until operator resets | Operator sets killSwitch: false |
spec.killIfCommands: [...] (this page) |
Agent on the node sees a matching process | EmergencyRemove |
Controller flips spec.schedule.enabled=false so the machine does not auto-rejoin at the next schedule boundary |
Operator (or the user whose workstation it is) sets schedule.enabled: true — see User-facing re-enable flow below |
Both are opt-in: killSwitch defaults to false, and killIfCommands is absent by default (no agent installed on the node).
The trigger chain¶
Each hop's responsibility stays narrow: the agent only writes annotations (node-scoped RBAC, no broad API access); the controller owns all cluster-mutating work (drain, Machine delete, schedule flip).
Lifecycle — where EmergencyRemove fits¶
The new phase plugs into the existing ScheduledMachine state machine as a branch off any "alive" phase:
Why the exit is Disabled, not Pending¶
This is the piece that makes emergency reclaim useful instead of infuriating. When the eject completes, the controller sets spec.schedule.enabled=false on the owning ScheduledMachine. Without that flip:
- Eject succeeds, node leaves cluster.
- Next schedule window opens (e.g. 9 AM the next morning).
- Controller sees "in-schedule, no machine exists" and re-creates the Machine → node rejoins.
- User's JVM is still running.
- Agent sees match → annotates → controller ejects again.
- Loop repeats every schedule window forever.
By flipping enabled=false the controller makes the user's explicit re-enable the trigger to return the node to service — which is the behavior a human actually wants.
Sequence of operations¶
Why this ordering matters:
- Annotations are cleared last, not first. If the user reboots or the agent restarts mid-eject, the annotation is still there, so the controller keeps processing — the operation is idempotent.
schedule.enabled=falseis set before annotation clear. This is the load-bearing step. If we cleared the annotation first and then crashed, the next reconcile would see no annotation, not flipenabled, and fall back to normal scheduling.- Machine deletion is immediate (
--grace-period=0). No PDB respect, nogracefulShutdownTimeoutwait, nonodeDrainTimeout— the whole point is that the user needs their box back now.
Opt-in installation¶
The reclaim agent is not installed on every node. Both gates below must be satisfied for the agent to land on a node:
Why two gates:
- Node label (
5spot.finos.org/reclaim-agent=enabled) — gates scheduling. An operator cannot accidentally install the agent on nodes that have not opted in; the DaemonSet simply will not schedule. - Per-node ConfigMap — gates behavior. Even if an agent pod is running, a config with both match lists empty returns
Noneevery scan and never annotates. This is the secondary safety net: "everyone has the agent installed with empty match list" is made impossible by construction because the operator has to type the patterns intospec.killIfCommandsto get anything to happen.
For regulated environments, this double-gating keeps the node-side surface proportional to declared intent — and the binary with elevated privileges (hostPID: true, reads every pid in /proc) never lands on nodes that have not declared they want it.
MVP limitation
The 2026-04-20 MVP ships the DaemonSet + RBAC but not the controller-side label-stamp and per-node ConfigMap projection. Operators who want to try the agent today must label nodes and create the ConfigMap manually. See the roadmap for the remaining work.
Configuring killIfCommands¶
Declare the match patterns on the ScheduledMachine spec:
apiVersion: 5spot.finos.org/v1alpha1
kind: ScheduledMachine
metadata:
name: dev-workstation-01
spec:
schedule:
daysOfWeek: [mon-fri]
hoursOfDay: [19-23] # join after hours
timezone: America/New_York
enabled: true
clusterName: dev-cluster
# Emergency reclaim: pull this node out the moment the user opens
# any of these processes. Matched against /proc/*/comm (exact basename)
# and /proc/*/cmdline (substring).
killIfCommands:
- java
- idea
- steam
bootstrapSpec:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: K0sWorkerConfig
spec:
version: v1.30.0+k0s.0
infrastructureSpec:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: RemoteMachine
spec:
address: 192.168.1.100
port: 22
user: admin
useSudo: true
Matching semantics¶
| Source | Kind | Example match |
|---|---|---|
/proc/<pid>/comm |
Exact basename, case-sensitive, max 15 chars (kernel limit) | java matches a process whose comm is literally java. Does not match Java or java-wrapper. |
/proc/<pid>/cmdline |
Substring of the NUL-separated argv joined with spaces | idea matches /opt/idea/bin/idea.sh -Xmx2g. Substring is case-sensitive. |
A config with both match lists empty (or killIfCommands: []) is inert — the agent returns None every scan and never annotates. Absent is equivalent to empty.
Poll interval¶
The agent polls /proc every poll_interval_ms (default 250 ms — see DEFAULT_POLL_INTERVAL_MS in src/reclaim_agent.rs). This is the worst-case detection latency on rung 1 (/proc poll).
Detector — rung 1 (/proc poll) vs rung 2 (netlink)¶
The agent ships two detection back-ends. Both produce identical Match values, both go through the same Node-PATCH path; the only difference is the event source.
| Mode | What it does | Latency | Idle CPU | Linux only? | Extra capability |
|---|---|---|---|---|---|
poll (rung 1) |
Walks /proc every poll_interval_ms |
up to one poll interval (250 ms default) | ~0 (one syscall per pid per tick) | No — works wherever /proc exists |
None |
netlink (rung 2) |
Subscribes to the kernel's proc connector (PROC_EVENT_EXEC); kernel pushes one message per exec(2) |
<10 ms (kernel-pushed) | sleeps until the kernel wakes it | Yes — Linux only | CAP_NET_ADMIN |
Selection:
--detector=auto # default. Linux → netlink, anywhere else → poll.
--detector=netlink # force netlink. Errors loudly on non-Linux or
# without CAP_NET_ADMIN.
--detector=poll # force the rung-1 poll loop on any platform.
Override at deploy time via RECLAIM_DETECTOR={auto|netlink|poll} on the DaemonSet pod.
Tradeoff. netlink is lower latency and lower idle CPU in steady state. Under heavy-exec workloads (make -j32, compile farms) it can be more expensive than poll, because rung 2 sees every short-lived process whereas rung 1 only sees those that survive to the next tick. If the operator's fleet runs exec storms, pin --detector=poll explicitly; otherwise keep the auto default.
The DaemonSet manifest grants CAP_NET_ADMIN once at the pod level (see deploy/node-agent/daemonset.yaml), so an operator switching --detector=poll → netlink (or vice versa) does not need to re-roll the pod's security context. The cap is unused on poll; the kernel never sees a netlink syscall in that mode.
Kernel and capability requirements¶
Rung 1 (poll) needs essentially nothing — /proc exists on every Linux kernel and the agent already has UID 0 + host /proc mount for other reasons. It also runs unmodified in a non-Linux container if you ever need to (the Subscriber constructor reports "unsupported" and the bin falls back automatically when --detector=auto is in effect).
Rung 2 (netlink) has three runtime prerequisites:
| Requirement | Provided by | What happens if missing |
|---|---|---|
Kernel built with CONFIG_PROC_EVENTS=y |
Distro kernel build (Debian, Ubuntu, RHEL/CentOS, Alpine, Talos, Bottlerocket, Chainguard host kernels — all default y) |
The netlink socket opens cleanly but the kernel never pushes events. Agent sits idle, never matches. No userspace error to detect this. Diagnose by zgrep CONFIG_PROC_EVENTS /proc/config.gz on the host; mitigate by --detector=poll. |
CAP_NET_ADMIN on the agent container |
deploy/node-agent/daemonset.yaml securityContext.capabilities.add: [NET_ADMIN] |
Subscriber::new() fails at startup with EPERM; the bin exits non-zero and kubelet restarts it (crash loop until the cap is granted). --detector=poll works without it. |
| Linux kernel ≥ 2.6.15 | Effectively guaranteed on any cluster you'd actually deploy 5-Spot to (released 2006) | Same as missing CONFIG_PROC_EVENTS — silent. |
The cap chain is straightforward:
deploy/node-agent/daemonset.yaml
└─ container.securityContext.capabilities
├─ drop: [ALL] # default-deny
└─ add: [NET_ADMIN] # rung 2 only — rung 1 doesn't use it
This is the only added capability. UID 0 + host /proc + hostPID were already required for rung 1 (the agent has to read every process's /proc/<pid>/{comm,cmdline} under hidepid=2). See .trivyignore for the architectural-necessity rationale on each Trivy / Semgrep finding the configuration trips.
CAP_NET_ADMIN is broader than just netlink-connector access — it also grants ability to edit routes, iptables, etc. on the host network namespace. The risk is bounded by:
- Opt-in nodeSelector — the cap is granted only on Nodes labelled
5spot.finos.org/reclaim-agent: enabled, which the controller stamps only when aScheduledMachineon that Node has a non-emptyspec.killIfCommands. A cluster that never useskillIfCommandsnever has the cap on any pod. - Single-purpose binary — the agent has no shell, no exec of children, no remote-control surface. Compromising it requires swapping the binary on disk, at which point an attacker could grant any cap they wanted anyway.
- Operator opt-out — pin
--detector=poll(envRECLAIM_DETECTOR=poll) and the cap is granted-but-unused. Operators with PSArestrictedprofiles or capability allowlists can deploy in this mode and never need the cap.
See the threat model for the full STRIDE table on the reclaim agent.
User-facing re-enable flow¶
When the user is ready to return the node to scheduled service:
kubectl patch scheduledmachine/dev-workstation-01 \
--type merge \
-p '{"spec":{"schedule":{"enabled":true}}}'
What happens next depends on whether the matched process is still running:
This is intentional. Re-enabling with the matched process still running is the controller's way of telling the user: "Your JVM is still eating the box. Quit it, then re-enable the schedule."
Permanently opting out¶
If the user wants to permanently remove the node from the emergency-reclaim path (e.g. they changed their mind about using the workstation as a cluster node), clear spec.killIfCommands:
kubectl patch scheduledmachine/dev-workstation-01 \
--type merge \
-p '{"spec":{"killIfCommands":null}}'
This triggers the controller to strip the node label and delete the per-node ConfigMap, which tears the DaemonSet pod off the node. Time-based scheduling resumes as normal.
Observing the emergency-reclaim path¶
Kubernetes Events¶
The controller emits two distinct Events on the ScheduledMachine:
| Reason | Emitted at | Meaning |
|---|---|---|
EmergencyReclaim |
Start of EmergencyRemove phase |
Agent annotation observed; entering non-graceful eject. Event message includes the reason string from the annotation (e.g. process-match: java). |
EmergencyReclaimDisabledSchedule |
After successful eject, before annotation clear | spec.schedule.enabled has been flipped to false. Event message tells the operator how to re-enable. |
View them with:
Status condition¶
A condition is written on the ScheduledMachine status so that kubectl get scheduledmachine -o yaml surfaces why the schedule is disabled:
status:
phase: Disabled
conditions:
- type: Scheduled
status: "False"
reason: EmergencyReclaimDisabledSchedule
message: "Schedule auto-disabled by emergency reclaim (process-match: java). Re-enable with kubectl patch when ready."
lastTransitionTime: "2026-04-20T21:45:00Z"
Without this condition, a human reading the object a week later would have no way to tell why enabled is false — whether the operator set it manually, whether there was an emergency reclaim, or whether it has always been off.
Node annotations (transient)¶
While the eject is in flight, the Node object carries the three reclaim annotations. They are cleared as the last step of the handler, so steady-state shows no trace on the Node. If you catch one mid-eject:
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jq
# {
# "5spot.finos.org/reclaim-requested": "true",
# "5spot.finos.org/reclaim-reason": "process-match: java",
# "5spot.finos.org/reclaim-requested-at": "2026-04-20T21:45:00Z"
# }
The annotation value "5spot.finos.org/reclaim-requested" is checked as a strict literal "true" — any other value ("false", "0", empty, missing) is treated as not-requested, which avoids partial-write foot-guns.
Agent logs¶
Each DaemonSet pod logs at info level on match and annotation write. From the host:
What is not part of emergency reclaim¶
These are out of scope — recorded here because operators often ask:
- Killing the matched process. 5-Spot never touches the user's process. The user started it on purpose; killing their JVM to "help them get their node back" is the opposite of the contract. Agent sees, signals, exits.
- "Reclaim only while the process is running" semantics. Match is a trigger, not a live predicate. If the user
kill -9s their JVM after the annotation is written but before the drain completes, the node still gets reclaimed. Partial-eject-and-rollback semantics would make the state machine significantly more complex for no real-world benefit. - Cross-node reclaim. Each agent is scoped to its own node. A user launching Java on node A does not reclaim node B.
- Auto-resume when the process exits. See Open Question 6 in the roadmap. Explicit re-enable by the user is the MVP choice; revisit only if operator feedback demands it.
Host-identity verification¶
Before the agent PATCHes any Node with reclaim annotations, it
cross-checks the host's /etc/machine-id against the target Node's
status.nodeInfo.machineID. Both are populated from the same source
(systemd-machine-id-setup on host boot; kubelet on Node registration),
so they agree on a healthy node — and a mismatch is a strong signal
that the agent is about to write to the wrong Node.
Why¶
This closes the modified-DaemonSet → impersonate-victim-Node path:
- Without the check, the agent trusts whatever value
NODE_NAMEcarries (sourced from the downward API:spec.nodeName). - An attacker with
update daemonsetsin5spot-systemcan overrideNODE_NAMEto a hard-coded value (e.g.victim-node). - The agent runs on host A, sees a process match, and PATCHes
victim-nodeinstead of host A — triggeringPhase::EmergencyRemoveon a node it has no business reclaiming.
The cross-check makes the impersonation visible: the spoofed Node's
status.nodeInfo.machineID belongs to a different host, so the
comparison fails and the agent refuses to write.
The exploit precondition (update daemonsets) is cluster-admin in most
clusters, so this is defence-in-depth — but the check is cheap and
removes a category of impersonation. Filed as Phase 4 of the
2026-04-25 security audit roadmap.
Modes¶
| Mode | Flag / env | Behaviour |
|---|---|---|
| Strict (default) | --skip-host-id-check=false / SKIP_HOST_ID_CHECK=false |
Read /etc/machine-id at startup; fail to start if missing or empty. Before each PATCH, fetch the target Node and refuse if status.nodeInfo.machineID does not match. |
| Bypass | --skip-host-id-check=true / SKIP_HOST_ID_CHECK=true |
Trust NODE_NAME blindly (pre-Phase-4 behaviour). Use only when /etc/machine-id is genuinely unavailable: containers without the host file mounted, dev sandboxes, kubelet variants that do not populate status.nodeInfo.machineID. |
How /etc/machine-id is exposed¶
The DaemonSet mounts the host file as a single read-only file (not the
whole /etc):
volumeMounts:
- name: host-machine-id
mountPath: /host/etc/machine-id
readOnly: true
volumes:
- name: host-machine-id
hostPath:
path: /etc/machine-id
type: File
type: File makes kubelet refuse to schedule the pod if the host file
is missing — fail-fast is preferable to silently degrading to skip-check
mode. The agent reads the path from MACHINE_ID_PATH (default
/host/etc/machine-id when deployed via the manifest).
What it does NOT defend against¶
- A compromise that lets an attacker modify both the DaemonSet and
the host's
/etc/machine-id(would require root on the node). - Kubelet bugs or out-of-band manipulation of
Node.status.nodeInfo.machineID. The kubelet writes this field; if the cluster trusts a misbehaving kubelet, identity verification has no anchor.
Both gaps are out of scope for a node-side agent; remediating them
requires a stronger node-attestation primitive (TPM-backed identity,
SPIFFE SVIDs, etc.) — not in this controller's scope. A phased design
for closing both gaps (TPM-quoted machine-id → SPIRE tpm_devid
attestation → controller-side SVID verification) is tracked in the
5spot-strong-node-attestation roadmap.
Related¶
- Machine Lifecycle — full phase state machine including
EmergencyRemove - ScheduledMachine — CRD reference including
killIfCommandsandkillSwitch - Troubleshooting — debugging the kill switch
- Roadmap — Emergency Node Reclaim by Process Match — design rationale, phase breakdown, open questions