Troubleshooting¶
Common issues and solutions for 5-Spot.
Diagnostic Commands¶
Check Operator Status¶
# Operator pods
kubectl get pods -n 5spot-system
# Operator logs (JSON — pipe through jq for readability)
kubectl logs -n 5spot-system -l app=5spot-controller --tail=100 | jq .
# Plain-text logs (for quick reads without jq)
RUST_LOG_FORMAT=text kubectl logs -n 5spot-system -l app=5spot-controller --tail=100
# Detailed pod info
kubectl describe pod -n 5spot-system -l app=5spot-controller
Filter Logs by Correlation ID¶
Every reconciliation carries a unique reconcile_id field. Use it to isolate all log lines for a single reconciliation attempt:
# Stream logs and filter by resource name, showing reconcile_id
kubectl logs -n 5spot-system -l app=5spot-controller -f | \
jq -c 'select(.fields.resource == "<machine-name>")'
# Trace a specific reconciliation end-to-end
kubectl logs -n 5spot-system -l app=5spot-controller | \
jq -c 'select(.fields.reconcile_id == "<id-from-a-previous-log-line>")'
# Find all Error-phase transitions
kubectl logs -n 5spot-system -l app=5spot-controller | \
jq -c 'select(.fields.to_phase == "Error")'
Check ScheduledMachines¶
# List all ScheduledMachines
kubectl get scheduledmachines -A
# Detailed status
kubectl describe scheduledmachine <name>
# Get status as JSON
kubectl get scheduledmachine <name> -o jsonpath='{.status}'
Check CAPI Machines¶
Common Issues¶
Machine Stuck in Pending¶
Symptoms:
- Machine stays in Pending phase
- No Machine resource created
Possible Causes:
-
Schedule not matching current time
-
Operator not running
-
RBAC permissions
Solution: - Verify schedule matches current time and timezone - Check controller logs for errors - Ensure RBAC is correctly configured
Machine Not Removing¶
Symptoms:
- Machine stays in Active after schedule window
- Grace period seems to never complete
Possible Causes:
-
Pods not draining
-
PodDisruptionBudget blocking eviction
PDB-blocked evictions (HTTP 429) now surface as a CapiError in the reconciler and will cause the machine to enter the Error phase. Check for blocking PDBs:
kubectl get pdb -A
# Look for PDBs with maxUnavailable: 0 or minAvailable matching current replicas
kubectl get pdb -A -o json | jq '.items[] | {name:.metadata.name, ns:.metadata.namespace, disruptions:.status.disruptionsAllowed}'
- Long grace period
Solution:
- Check for pods that can't be evicted; look for warn log lines with "Pod eviction blocked by PDB (HTTP 429)"
- Review PDB settings — temporarily scale up or relax minAvailable to allow drain
- Consider using killSwitch: true for immediate removal (bypasses drain)
Schedule Not Evaluating¶
Symptoms: - Machine doesn't activate during schedule window - No status changes
Possible Causes:
-
Schedule disabled
-
Timezone mismatch
-
Multi-instance: wrong instance handling resource
Solution:
- Ensure enabled: true
- Verify timezone is correct
- Check controller instance distribution
CAPI Integration Errors¶
Symptoms: - Error events on ScheduledMachine - CAPI Machine not being created
Possible Causes:
-
Invalid bootstrapRef or infrastructureRef
-
CAPI provider not ready
Solution: - Verify references point to existing resources - Check CAPI provider health - Review CAPI controller logs
Reconciliation Retrying with Increasing Delay¶
Symptoms:
- Repeated error events on a ScheduledMachine
- Logs show retry_count climbing and backoff_secs growing (30 → 60 → 120 → 240 → 300)
Cause: The controller uses bounded exponential back-off. Each consecutive failure doubles the retry delay up to 300 s (5 min). The counter resets after a successful reconciliation.
# Watch the retry_count and backoff_secs fields
kubectl logs -n 5spot-system -l app=5spot-controller -f | \
jq -c 'select(.fields.resource == "<machine-name>") | {retry: .fields.retry_count, backoff: .fields.backoff_secs, error: .fields.error}'
Solution: - Check the underlying error causing repeated failures (CAPI, schedule, validation) - Once the root cause is fixed, the next successful reconciliation resets the counter - If the resource is stuck at max backoff (300 s), fix the underlying issue and patch the resource to trigger an immediate reconcile:
kubectl annotate scheduledmachine <name> 5spot.finos.org/force-reconcile="$(date -u +%s)" --overwrite
Orphan resources after finalizer timeout¶
Symptom:
- fivespot_finalizer_cleanup_timeouts_total increments above zero.
- A kubectl describe scheduledmachine <name> shows a Warning event with reason FinalizerCleanupTimedOut.
- The ScheduledMachine has been deleted (gone from kubectl get) but a CAPI Machine, bootstrap resource, or infrastructure resource may still exist in the namespace.
Root cause.
handle_deletion wraps CAPI cleanup in a hard timeout
(FINALIZER_CLEANUP_TIMEOUT_SECS, default 600s / 10 minutes) so a hung
eviction cannot stall namespace deletion. By default
(--force-finalizer-on-timeout=true, env FORCE_FINALIZER_ON_TIMEOUT=true)
the controller force-removes its finalizer when the timeout fires —
unblocking namespace deletion at the cost of potentially leaving CAPI
resources without a managing ScheduledMachine. The most common trigger
is a misconfigured Pod Disruption Budget (e.g. minAvailable: 999)
on a workload the controller is trying to evict during node drain.
Runbook.
-
Find the orphaned CAPI Machine:
# Machines from this namespace whose owning ScheduledMachine no longer exists. kubectl get machines.cluster.x-k8s.io -n <ns> -o json \ | jq -r '.items[] | select(.metadata.ownerReferences[]?.kind == "ScheduledMachine") | .metadata.name' for m in $(kubectl get machines.cluster.x-k8s.io -n <ns> -o name); do owner=$(kubectl get $m -n <ns> -o jsonpath='{.metadata.ownerReferences[?(@.kind=="ScheduledMachine")].name}') if [ -n "$owner" ] && ! kubectl get scheduledmachine "$owner" -n <ns> >/dev/null 2>&1; then echo "ORPHAN: $m (was owned by $owner)" fi done -
Identify the bootstrap and infrastructure resources the orphan Machine references:
-
Delete the orphan Machine first; CAPI cascades into the bootstrap / infra refs via ownerReferences:
If the Machine itself is stuck terminating (drain still blocked), inspect Pods + PDBs on the underlying Node and remove the offending PDB before retrying. -
Verify nothing is left behind:
Prevention.
- Alert on
rate(fivespot_finalizer_cleanup_timeouts_total[5m]) > 0. - Validate Pod Disruption Budgets at admission (CEL
ValidatingAdmissionPolicy) — rejectminAvailablevalues that exceed the workload's replica count. - For environments where stalled SMs are preferable to potential
orphans (e.g. a sweep job is in place), set
--force-finalizer-on-timeout=false(envFORCE_FINALIZER_ON_TIMEOUT=false). The metric and Warning event fire in both modes; the only difference is whether the finalizer is removed on timeout. Strict mode requires an external sweep to garbage-collect SMs whose drain is permanently blocked, otherwise namespace deletion stalls indefinitely.
Emergency Reclaim (Kill Switch)¶
See Emergency Reclaim for the full lifecycle. This section covers the diagnostic angles most operators hit in the field.
ScheduledMachine stuck in EmergencyRemove¶
Symptoms:
kubectl get scheduledmachineshowsPHASE=EmergencyRemoveand does not move toDisabled.- The node still appears in the cluster.
Diagnosis:
# Is the reclaim annotation still on the Node? (expected during eject, cleared at end)
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jq \
'with_entries(select(.key | startswith("5spot.finos.org/reclaim")))'
# Controller logs for the emergency-remove handler
kubectl logs -n 5spot-system -l app=5spot-controller --tail=200 | \
jq -c 'select(.fields.phase == "EmergencyRemove")'
# Events on the ScheduledMachine
kubectl describe scheduledmachine/<name> | grep -A 5 Events
Common causes:
- Drain is blocked by non-evictable pods. The handler uses
--force --disable-eviction, so this should be rare — if it happens, a pod is probably stuck inTerminatingwaiting on a finalizer of its own. - CAPI Machine deletion is blocked. Check
kubectl describe machine/<machine-name>for a finalizer that has not been cleared. - Controller crashed mid-handler. On restart the annotation is still there (cleared last), so the handler will retry from the top — the operation is idempotent.
Node keeps getting ejected every schedule window¶
Symptom: The ScheduledMachine cycles Disabled → Pending → Active → EmergencyRemove → Disabled → ... at every schedule boundary.
Cause: The matched process is still running, the user re-enabled the schedule without quitting it first, and the agent correctly re-fired on the next poll.
Confirm:
# Check what the agent matched on
kubectl logs -n 5spot-system -l app=5spot-reclaim-agent --tail=50 | jq -c 'select(.fields.matched_pattern)'
# Check the condition reason on the ScheduledMachine
kubectl get scheduledmachine/<name> -o jsonpath='{.status.conditions}' | jq \
'.[] | select(.reason == "EmergencyReclaimDisabledSchedule")'
Solution: Quit the matched process on the node, then re-enable:
If the user does not want this node in the reclaim path at all, clear killIfCommands:
Reclaim agent never fires on a known-matching process¶
Symptoms: User has a matching process running, but the Node never gets annotated.
Checklist:
-
Is the agent pod actually running on the node?
If no pod lands on the target node, the5spot.finos.org/reclaim-agent=enabledlabel is probably missing. Check the node labels: -
Is the per-node ConfigMap present and readable? The agent no longer mounts its config from a file — it watches the per-node
Missing ConfigMap → agent idles (no proc scanning) until one appears. EmptyConfigMapnamedreclaim-agent-<NODE_NAME>in5spot-systemvia the kube API and hot-reloads on every change. Check the ConfigMap directly:match_commands+ emptymatch_argv_substrings= agent is armed but inert (never matches) by design. The agent logsconfigmap applied — rearming scannerat INFO on every observed change; tail the pod logs to confirm it sees yours: -
Is the agent reading real
Expect many numeric directory names. If you only see/proc?1andself, the pod'shostPID: truemount is broken — re-check the DaemonSet template. -
Match is case-sensitive.
match_commands = ["Java"]does not match ajavaprocess. Lowercase the pattern to match the typical JVM binary name. -
The agent only reads
/proc/<pid>/comm(exact basename) and/proc/<pid>/cmdline(substring). A process whosecommisjava-wrapperbut argv starts with/opt/jdk/bin/java ...matches oncmdline(substring), not oncomm(exact).
Reclaim agent crash-loops with EPERM on netlink socket¶
Symptom: kubectl logs -n 5spot-system <reclaim-agent-pod> shows
the agent exiting at startup with an error like:
or Subscriber::new() failed: netlink i/o: EPERM.
Cause: --detector=netlink is in effect (the default on Linux),
but the pod's container does not have CAP_NET_ADMIN. The kernel
refuses to bind an AF_NETLINK socket to the CN_IDX_PROC
multicast group without it.
Fix (one of):
- Confirm the cap is in the manifest —
deploy/node-agent/daemonset.yamlshould grantNET_ADMINunder the container'ssecurityContext.capabilities.add. If a downstream patch removed it, restore:
kubectl get ds -n 5spot-system 5spot-reclaim-agent \
-o jsonpath='{.spec.template.spec.containers[0].securityContext.capabilities}'
# Expect: {"add":["NET_ADMIN"],"drop":["ALL"]}
- Force
pollmode if grantingCAP_NET_ADMINis not acceptable in your environment. The agent's rung 1 (/procpoll) detects the same processes at higher latency (≤250 ms vs <10 ms) and zero added capability:
See Detector — rung 1 vs rung 2 for the tradeoff.
- Pod Security Admission may strip capabilities under the
restrictedprofile. The DaemonSet's pod-levelsecurityContextalready dropsrunAsNonRoot=falseand other restricted-profile bumps; if PSA is enforcingrestrictedon5spot-system, downgrade the namespace label tobaselineor add an exemption. The reclaim-agent's privileges are architecturally bounded by its opt-in nodeSelector.
Reclaim agent runs but never observes any events¶
Symptom: Agent pod is Running, no errors in logs, host has
exec activity (e.g. user is launching processes), but kubectl
describe node <name> never shows the reclaim annotations even with
a known-matching killIfCommands list.
Possible cause: kernel built without CONFIG_PROC_EVENTS. This
is rare but possible on heavily-stripped distro kernels (some
embedded / hardened images). The netlink socket opens cleanly but
the kernel never pushes events.
Diagnose:
# Inspect the running kernel's config (if /proc/config.gz is present)
zgrep CONFIG_PROC_EVENTS /proc/config.gz
# Or grep the booted kernel's config file
grep CONFIG_PROC_EVENTS /boot/config-$(uname -r)
# Expect: CONFIG_PROC_EVENTS=y
Fix: switch to --detector=poll (or RECLAIM_DETECTOR=poll).
The rung-1 poll does not depend on the kernel feature.
Rapid re-reclaim loop — RapidReReclaim Warning Event¶
Symptom: kubectl get events shows multiple RapidReReclaim
warnings on the same ScheduledMachine within minutes; the
controller log carries the corresponding warning lines; the
fivespot_rapid_re_reclaims_total{namespace, name} metric is
incrementing.
Meaning: The controller has observed ≥3 emergency-reclaim
events for this SM within 10 minutes
(RAPID_RE_RECLAIM_THRESHOLD = 3,
RAPID_RE_RECLAIM_WINDOW_SECS = 600). Almost always: a user is
re-enabling the schedule (spec.schedule.enabled=true) before
they have stopped the conflicting process the agent is matching
on. The agent fires again the moment the Node rejoins, and the
loop repeats.
Diagnose:
# 1. Find the offending SM and pull its emergency-reclaim history
kubectl get events --field-selector reason=EmergencyReclaim \
--sort-by='.lastTimestamp' \
-o jsonpath='{range .items[*]}{.lastTimestamp} {.involvedObject.namespace}/{.involvedObject.name} {.message}{"\n"}{end}'
# 2. Check what `killIfCommands` patterns are configured
kubectl get scheduledmachine <name> -o jsonpath='{.spec.killIfCommands}'
# 3. Walk the agent log to see what was matched on the most recent fire
kubectl logs -n 5spot-system -l app=5spot-reclaim-agent --tail=100 \
| jq -c 'select(.fields.matched_pattern)'
Fix paths (operator-side):
- Confirm with the user that the conflicting process is gone
before they re-enable. The whole point of the agent is the user's
workstation got their attention;
ps aux | grep <pattern>on the host (orkubectl debug node/<node-name> -it --image=alpineif no host shell) is the canonical confirmation. - Adjust
killIfCommandsif the pattern is over-matching (e.g.javamatches every JVM, including the user's IDE that may be acceptable). Switch to the more specificmatch_argv_substringsform via direct ConfigMap edit, or narrow the basename. - Suppress the loop temporarily by clearing the spec: This tears down the per-node ConfigMap + label, which evicts the agent pod from the Node, which means no further reclaim fires even if the user starts a matching process.
Why no auto-resume: the controller deliberately does not flip
spec.schedule.enabled back to true automatically when the
matching process exits. Doing so would invite races between the
Node rejoining and the agent restarting, and would silently mask
the user behaviour the warning is trying to surface. Re-enable is
explicit by design.
EmergencyReclaim event fires but schedule is not disabled¶
Symptom: The EmergencyReclaim event is on the ScheduledMachine, but spec.schedule.enabled is still true.
This indicates the controller crashed between the drain/delete steps and the enabled=false PATCH. The Node annotation is cleared after the PATCH, so the controller will see the annotation on the next reconcile and retry. If it does not, check:
# Is the EmergencyReclaimDisabledSchedule event present?
kubectl get events --field-selector reason=EmergencyReclaimDisabledSchedule \
--sort-by='.lastTimestamp'
# If yes, but spec.schedule.enabled is still true, the PATCH may have lost a race
# with a user edit. Check the generation on the ScheduledMachine:
kubectl get scheduledmachine/<name> -o jsonpath='{.metadata.generation} {.status.observedGeneration}'
Node Taints¶
Taints not appearing on Node¶
spec.nodeTaints is declared on a ScheduledMachine, the machine is Active,
but the Node does not have the expected taints. Walk the NodeTainted
condition on the CR first — it tells you exactly which layer is failing.
kubectl get scheduledmachine <name> \
-o jsonpath='{.status.conditions[?(@.type=="NodeTainted")]}{"\n"}'
Three failure reasons, each with its own fix:
reason=NoNodeYet (status=Unknown)
CAPI populated status.nodeRef but the Node object is not yet in the API
server. This is usually a few seconds after Machine creation. The Node watch
will re-enqueue us automatically — no action needed. If stuck for > 1 min,
check that CAPI's Machine actually materialised the Node:
kubectl get machine <name>-machine -o jsonpath='{.status.nodeRef}{"\n"}'
kubectl get nodes <node-name>
reason=NodeNotReady (status=False)
Node exists but Ready != True. Kubelet hasn't registered, networking is
degraded, or CNI is failing. Look at the Node's own conditions first:
Fix the underlying Node problem; the controller will re-reconcile on the next
Node Ready transition.
reason=TaintOwnershipConflict (status=False)
An admin taint exists with the same (key, effect) tuple as a declared
spec.nodeTaints entry. The controller refuses to overwrite admin-owned
taints. Inspect the current state:
kubectl get node <node-name> -o jsonpath='{.spec.taints}{"\n"}'
kubectl get node <node-name> \
-o jsonpath='{.metadata.annotations.5spot\.finos\.org/applied-taints}{"\n"}'
Resolve by either removing the admin taint (kubectl taint nodes <node> key:effect-)
or changing the spec.nodeTaints entry so the (key, effect) no longer
collides. Note: the annotation 5spot.finos.org/applied-taints lists the
keys we own; any taint not in that list belongs to the admin.
reason=PatchFailed (status=False)
A non-404, non-conflict API error on the Node PATCH. Check controller logs for
the exact kube error (RBAC rejection, API server unreachable, etc.):
The controller retries with exponential backoff on transient failures. For
RBAC issues, confirm the controller ClusterRole grants patch on nodes.
Error Messages¶
"Resource not owned by this instance"¶
Cause: Multi-instance deployment where this resource is assigned to a different instance.
Solution: This is expected behavior. Each instance handles a subset of resources.
"Failed to evaluate schedule"¶
Cause: Invalid schedule configuration.
Solution: Check schedule syntax:
- Days: mon-fri, not monday-friday
- Hours: 9-17, not 9:00-17:00
- Timezone: Valid IANA name like America/New_York
"Machine creation failed"¶
Cause: CAPI couldn't create the machine.
Solution:
1. Check CAPI logs: kubectl logs -n capi-system -l control-plane=controller-manager
2. Verify infrastructure provider is configured
3. Check bootstrap template validity
Getting Help¶
Collect Debug Information¶
# Operator version
kubectl get deployment -n 5spot-system 5spot-controller -o jsonpath='{.spec.template.spec.containers[0].image}'
# Full controller logs
kubectl logs -n 5spot-system -l app=5spot-controller --all-containers > controller-logs.txt
# ScheduledMachine YAML
kubectl get scheduledmachine <name> -o yaml > scheduledmachine.yaml
# Events
kubectl get events -A --sort-by='.lastTimestamp' > events.txt
Filing Issues¶
When filing a GitHub issue, include:
- 5-Spot version
- Kubernetes version
- CAPI version
- Operator logs (sensitive data redacted)
- ScheduledMachine YAML
- Expected vs actual behavior
Related¶
- Configuration - Operator configuration
- Monitoring - Metrics and health checks
- Machine Lifecycle - Understanding phases