Troubleshooting¶

Common issues and solutions for 5-Spot.

Diagnostic Commands¶

Check Operator Status¶

# Operator pods
kubectl get pods -n 5spot-system

# Operator logs (JSON — pipe through jq for readability)
kubectl logs -n 5spot-system -l app=5spot-controller --tail=100 | jq .

# Plain-text logs (for quick reads without jq)
RUST_LOG_FORMAT=text kubectl logs -n 5spot-system -l app=5spot-controller --tail=100

# Detailed pod info
kubectl describe pod -n 5spot-system -l app=5spot-controller

Filter Logs by Correlation ID¶

Every reconciliation carries a unique reconcile_id field. Use it to isolate all log lines for a single reconciliation attempt:

# Stream logs and filter by resource name, showing reconcile_id
kubectl logs -n 5spot-system -l app=5spot-controller -f | \
  jq -c 'select(.fields.resource == "<machine-name>")'

# Trace a specific reconciliation end-to-end
kubectl logs -n 5spot-system -l app=5spot-controller | \
  jq -c 'select(.fields.reconcile_id == "<id-from-a-previous-log-line>")'

# Find all Error-phase transitions
kubectl logs -n 5spot-system -l app=5spot-controller | \
  jq -c 'select(.fields.to_phase == "Error")'

Check ScheduledMachines¶

# List all ScheduledMachines
kubectl get scheduledmachines -A

# Detailed status
kubectl describe scheduledmachine <name>

# Get status as JSON
kubectl get scheduledmachine <name> -o jsonpath='{.status}'

Check CAPI Machines¶

# List CAPI machines
kubectl get machines -A

# Describe machine
kubectl describe machine <name>

Common Issues¶

Machine Stuck in Pending¶

Symptoms: - Machine stays in Pending phase - No Machine resource created

Possible Causes:

Schedule not matching current time

# Check current time vs schedule
kubectl get scheduledmachine <name> -o jsonpath='{.spec.schedule}'
date -u  # Compare with UTC

Operator not running
```
kubectl get pods -n 5spot-system
```

RBAC permissions

kubectl auth can-i create machines --as=system:serviceaccount:5spot-system:5spot-controller

Solution: - Verify schedule matches current time and timezone - Check controller logs for errors - Ensure RBAC is correctly configured

Machine Not Removing¶

Symptoms: - Machine stays in Active after schedule window - Grace period seems to never complete

Possible Causes:

Pods not draining

kubectl get pods -o wide | grep <machine-name>

PodDisruptionBudget blocking eviction

PDB-blocked evictions (HTTP 429) now surface as a CapiError in the reconciler and will cause the machine to enter the Error phase. Check for blocking PDBs:

kubectl get pdb -A
# Look for PDBs with maxUnavailable: 0 or minAvailable matching current replicas
kubectl get pdb -A -o json | jq '.items[] | {name:.metadata.name, ns:.metadata.namespace, disruptions:.status.disruptionsAllowed}'

Long grace period

kubectl get scheduledmachine <name> -o jsonpath='{.spec.gracefulShutdownTimeout}'

Solution: - Check for pods that can't be evicted; look for warn log lines with "Pod eviction blocked by PDB (HTTP 429)" - Review PDB settings — temporarily scale up or relax minAvailable to allow drain - Consider using killSwitch: true for immediate removal (bypasses drain)

Schedule Not Evaluating¶

Symptoms: - Machine doesn't activate during schedule window - No status changes

Possible Causes:

Machine disabled

kubectl get scheduledmachine <name> -o jsonpath='{.spec.enabled}'

Timezone mismatch (on the referenced spot-schedule provider)

# spec.schedule references a provider (e.g. TimeBasedSpotSchedule); check its timezone
kubectl get scheduledmachine <name> -o jsonpath='{.spec.schedule.name}'
TZ=<timezone> date  # Check time in that timezone

Multi-instance: wrong instance handling resource

# Check which instance should handle this resource
kubectl logs -n 5spot-system -l app=5spot-controller | grep <resource-name>

Solution: - Ensure enabled: true - Verify timezone is correct - Check controller instance distribution

CAPI Integration Errors¶

Symptoms: - Error events on ScheduledMachine - CAPI Machine not being created

Possible Causes:

Invalid bootstrapRef or infrastructureRef

kubectl get scheduledmachine <name> -o jsonpath='{.spec.bootstrapRef}'
kubectl get <kind> <name> -n <namespace>  # Verify reference exists

CAPI provider not ready

kubectl get pods -n capi-system
kubectl get pods -n capi-kubeadm-bootstrap-system

Solution: - Verify references point to existing resources - Check CAPI provider health - Review CAPI controller logs

Reconciliation Retrying with Increasing Delay¶

Symptoms: - Repeated error events on a ScheduledMachine - Logs show retry_count climbing and backoff_secs growing (30 → 60 → 120 → 240 → 300)

Cause: The controller uses bounded exponential back-off. Each consecutive failure doubles the retry delay up to 300 s (5 min). The counter resets after a successful reconciliation.

# Watch the retry_count and backoff_secs fields
kubectl logs -n 5spot-system -l app=5spot-controller -f | \
  jq -c 'select(.fields.resource == "<machine-name>") | {retry: .fields.retry_count, backoff: .fields.backoff_secs, error: .fields.error}'

Solution: - Check the underlying error causing repeated failures (CAPI, schedule, validation) - Once the root cause is fixed, the next successful reconciliation resets the counter - If the resource is stuck at max backoff (300 s), fix the underlying issue and patch the resource to trigger an immediate reconcile:

kubectl annotate scheduledmachine <name> 5spot.finos.org/force-reconcile="$(date -u +%s)" --overwrite

Orphan resources after finalizer timeout¶

Symptom: - fivespot_finalizer_cleanup_timeouts_total increments above zero. - A kubectl describe scheduledmachine <name> shows a Warning event with reason FinalizerCleanupTimedOut. - The ScheduledMachine has been deleted (gone from kubectl get) but a CAPI Machine, bootstrap resource, or infrastructure resource may still exist in the namespace.

Root cause. handle_deletion wraps CAPI cleanup in a hard timeout (FINALIZER_CLEANUP_TIMEOUT_SECS, default 600s / 10 minutes) so a hung eviction cannot stall namespace deletion. By default (--force-finalizer-on-timeout=true, env FORCE_FINALIZER_ON_TIMEOUT=true) the controller force-removes its finalizer when the timeout fires — unblocking namespace deletion at the cost of potentially leaving CAPI resources without a managing ScheduledMachine. The most common trigger is a misconfigured Pod Disruption Budget (e.g. minAvailable: 999) on a workload the controller is trying to evict during node drain.

Runbook.

Find the orphaned CAPI Machine:

# Machines from this namespace whose owning ScheduledMachine no longer exists.
kubectl get machines.cluster.x-k8s.io -n <ns> -o json \
  | jq -r '.items[] | select(.metadata.ownerReferences[]?.kind == "ScheduledMachine")
           | .metadata.name'
for m in $(kubectl get machines.cluster.x-k8s.io -n <ns> -o name); do
  owner=$(kubectl get $m -n <ns> -o jsonpath='{.metadata.ownerReferences[?(@.kind=="ScheduledMachine")].name}')
  if [ -n "$owner" ] && ! kubectl get scheduledmachine "$owner" -n <ns> >/dev/null 2>&1; then
    echo "ORPHAN: $m (was owned by $owner)"
  fi
done

Identify the bootstrap and infrastructure resources the orphan Machine references:

kubectl get machine.cluster.x-k8s.io <orphan-name> -n <ns> -o jsonpath='{.spec.bootstrap.configRef}'
kubectl get machine.cluster.x-k8s.io <orphan-name> -n <ns> -o jsonpath='{.spec.infrastructureRef}'

Delete the orphan Machine first; CAPI cascades into the bootstrap / infra refs via ownerReferences:
```
kubectl delete machine.cluster.x-k8s.io <orphan-name> -n <ns>
```
If the Machine itself is stuck terminating (drain still blocked), inspect Pods + PDBs on the underlying Node and remove the offending PDB before retrying.

Verify nothing is left behind:

kubectl get machines.cluster.x-k8s.io,k0sworkerconfigs.k0smotron.io,remotemachines.k0smotron.io -n <ns>

Prevention.

Alert on rate(fivespot_finalizer_cleanup_timeouts_total[5m]) > 0.
Validate Pod Disruption Budgets at admission (CEL ValidatingAdmissionPolicy) — reject minAvailable values that exceed the workload's replica count.
For environments where stalled SMs are preferable to potential orphans (e.g. a sweep job is in place), set --force-finalizer-on-timeout=false (env FORCE_FINALIZER_ON_TIMEOUT=false). The metric and Warning event fire in both modes; the only difference is whether the finalizer is removed on timeout. Strict mode requires an external sweep to garbage-collect SMs whose drain is permanently blocked, otherwise namespace deletion stalls indefinitely.

Emergency Reclaim (Kill Switch)¶

See Emergency Reclaim for the full lifecycle. This section covers the diagnostic angles most operators hit in the field.

ScheduledMachine stuck in `EmergencyRemove`¶

Symptoms:

kubectl get scheduledmachine shows PHASE=EmergencyRemove and does not move to Disabled.
The node still appears in the cluster.

Diagnosis:

# Is the reclaim annotation still on the Node? (expected during eject, cleared at end)
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jq \
  'with_entries(select(.key | startswith("5spot.finos.org/reclaim")))'

# Controller logs for the emergency-remove handler
kubectl logs -n 5spot-system -l app=5spot-controller --tail=200 | \
  jq -c 'select(.fields.phase == "EmergencyRemove")'

# Events on the ScheduledMachine
kubectl describe scheduledmachine/<name> | grep -A 5 Events

Common causes:

Drain is blocked by non-evictable pods. The handler uses --force --disable-eviction, so this should be rare — if it happens, a pod is probably stuck in Terminating waiting on a finalizer of its own.
CAPI Machine deletion is blocked. Check kubectl describe machine/<machine-name> for a finalizer that has not been cleared.
Controller crashed mid-handler. On restart the annotation is still there (cleared last), so the handler will retry from the top — the operation is idempotent.

Node keeps getting ejected every schedule window¶

Symptom: The ScheduledMachine cycles Disabled → Pending → Active → EmergencyRemove → Disabled → ... at every schedule boundary.

Cause: The matched process is still running, the user re-enabled the schedule without quitting it first, and the agent correctly re-fired on the next poll.

Confirm:

# Check what the agent matched on
kubectl logs -n 5spot-system -l app=5spot-reclaim-agent --tail=50 | jq -c 'select(.fields.matched_pattern)'

# Check the condition reason on the ScheduledMachine
kubectl get scheduledmachine/<name> -o jsonpath='{.status.conditions}' | jq \
  '.[] | select(.reason == "EmergencyReclaimDisabledSchedule")'

Solution: Quit the matched process on the node, then re-enable:

kubectl patch scheduledmachine/<name> --type merge \
  -p '{"spec":{"schedule":{"enabled":true}}}'

If the user does not want this node in the reclaim path at all, clear killIfCommands:

kubectl patch scheduledmachine/<name> --type merge \
  -p '{"spec":{"killIfCommands":null}}'

Reclaim agent never fires on a known-matching process¶

Symptoms: User has a matching process running, but the Node never gets annotated.

Checklist:

Is the agent pod actually running on the node?
```
kubectl get pods -n 5spot-system -l app=5spot-reclaim-agent -o wide
```
If no pod lands on the target node, the 5spot.finos.org/reclaim-agent=enabled label is probably missing. Check the node labels:
```
kubectl get node <node-name> --show-labels | grep reclaim-agent
```
Is the per-node ConfigMap present and readable? The agent no longer mounts its config from a file — it watches the per-node ConfigMap named reclaim-agent-<NODE_NAME> in 5spot-system via the kube API and hot-reloads on every change. Check the ConfigMap directly:
```
kubectl get cm -n 5spot-system reclaim-agent-<node-name> -o jsonpath='{.data.reclaim\.toml}'
```
Missing ConfigMap → agent idles (no proc scanning) until one appears. Empty match_commands + empty match_argv_substrings = agent is armed but inert (never matches) by design. The agent logs configmap applied — rearming scanner at INFO on every observed change; tail the pod logs to confirm it sees yours:
```
kubectl logs -n 5spot-system <agent-pod> | grep configmap
```
Is the agent reading real /proc?
```
kubectl exec -n 5spot-system <agent-pod> -- ls /host/proc | head
```
Expect many numeric directory names. If you only see 1 and self, the pod's hostPID: true mount is broken — re-check the DaemonSet template.
Match is case-sensitive. match_commands = ["Java"] does not match a java process. Lowercase the pattern to match the typical JVM binary name.
The agent only reads /proc/<pid>/comm (exact basename) and /proc/<pid>/cmdline (substring). A process whose comm is java-wrapper but argv starts with /opt/jdk/bin/java ... matches on cmdline (substring), not on comm (exact).

Reclaim agent crash-loops with `EPERM` on netlink socket¶

Symptom: kubectl logs -n 5spot-system <reclaim-agent-pod> shows the agent exiting at startup with an error like:

netlink i/o: Operation not permitted (os error 1)

or Subscriber::new() failed: netlink i/o: EPERM.

Cause: --detector=netlink is in effect (the default on Linux), but the pod's container does not have CAP_NET_ADMIN. The kernel refuses to bind an AF_NETLINK socket to the CN_IDX_PROC multicast group without it.

Fix (one of):

Confirm the cap is in the manifest — deploy/node-agent/daemonset.yaml should grant NET_ADMIN under the container's securityContext.capabilities.add. If a downstream patch removed it, restore:

kubectl get ds -n 5spot-system 5spot-reclaim-agent \
  -o jsonpath='{.spec.template.spec.containers[0].securityContext.capabilities}'
# Expect: {"add":["NET_ADMIN"],"drop":["ALL"]}

Force poll mode if granting CAP_NET_ADMIN is not acceptable in your environment. The agent's rung 1 (/proc poll) detects the same processes at higher latency (≤250 ms vs <10 ms) and zero added capability:

kubectl set env -n 5spot-system ds/5spot-reclaim-agent \
  RECLAIM_DETECTOR=poll

See Detector — rung 1 vs rung 2 for the tradeoff.

Pod Security Admission may strip capabilities under the restricted profile. The DaemonSet's pod-level securityContext already drops runAsNonRoot=false and other restricted-profile bumps; if PSA is enforcing restricted on 5spot-system, downgrade the namespace label to baseline or add an exemption. The reclaim-agent's privileges are architecturally bounded by its opt-in nodeSelector.

Reclaim agent runs but never observes any events¶

Symptom: Agent pod is Running, no errors in logs, host has exec activity (e.g. user is launching processes), but kubectl describe node <name> never shows the reclaim annotations even with a known-matching killIfCommands list.

Possible cause: kernel built without CONFIG_PROC_EVENTS. This is rare but possible on heavily-stripped distro kernels (some embedded / hardened images). The netlink socket opens cleanly but the kernel never pushes events.

Diagnose:

# Inspect the running kernel's config (if /proc/config.gz is present)
zgrep CONFIG_PROC_EVENTS /proc/config.gz
# Or grep the booted kernel's config file
grep CONFIG_PROC_EVENTS /boot/config-$(uname -r)
# Expect: CONFIG_PROC_EVENTS=y

Fix: switch to --detector=poll (or RECLAIM_DETECTOR=poll). The rung-1 poll does not depend on the kernel feature.

Rapid re-reclaim loop — `RapidReReclaim` Warning Event¶

Symptom: kubectl get events shows multiple RapidReReclaim warnings on the same ScheduledMachine within minutes; the controller log carries the corresponding warning lines; the fivespot_rapid_re_reclaims_total{namespace, name} metric is incrementing.

Meaning: The controller has observed ≥3 emergency-reclaim events for this SM within 10 minutes (RAPID_RE_RECLAIM_THRESHOLD = 3, RAPID_RE_RECLAIM_WINDOW_SECS = 600). Almost always: a user is re-enabling the machine (spec.enabled=true) before they have stopped the conflicting process the agent is matching on. The agent fires again the moment the Node rejoins, and the loop repeats.

Diagnose:

# 1. Find the offending SM and pull its emergency-reclaim history
kubectl get events --field-selector reason=EmergencyReclaim \
  --sort-by='.lastTimestamp' \
  -o jsonpath='{range .items[*]}{.lastTimestamp} {.involvedObject.namespace}/{.involvedObject.name} {.message}{"\n"}{end}'

# 2. Check what `killIfCommands` patterns are configured
kubectl get scheduledmachine <name> -o jsonpath='{.spec.killIfCommands}'

# 3. Walk the agent log to see what was matched on the most recent fire
kubectl logs -n 5spot-system -l app=5spot-reclaim-agent --tail=100 \
  | jq -c 'select(.fields.matched_pattern)'

Fix paths (operator-side):

Confirm with the user that the conflicting process is gone before they re-enable. The whole point of the agent is the user's workstation got their attention; ps aux | grep <pattern> on the host (or kubectl debug node/<node-name> -it --image=alpine if no host shell) is the canonical confirmation.
Adjust killIfCommands if the pattern is over-matching (e.g. java matches every JVM, including the user's IDE that may be acceptable). Switch to the more specific match_argv_substrings form via direct ConfigMap edit, or narrow the basename.
Suppress the loop temporarily by clearing the spec:
```
kubectl patch scheduledmachine <name> --type=merge \
  -p '{"spec":{"killIfCommands":null}}'
```
This tears down the per-node ConfigMap + label, which evicts the agent pod from the Node, which means no further reclaim fires even if the user starts a matching process.

Why no auto-resume: the controller deliberately does not flip spec.enabled back to true automatically when the matching process exits. Doing so would invite races between the Node rejoining and the agent restarting, and would silently mask the user behaviour the warning is trying to surface. Re-enable is explicit by design.

`EmergencyReclaim` event fires but schedule is not disabled¶

Symptom: The EmergencyReclaim event is on the ScheduledMachine, but spec.enabled is still true.

This indicates the controller crashed between the drain/delete steps and the enabled=false PATCH. The Node annotation is cleared after the PATCH, so the controller will see the annotation on the next reconcile and retry. If it does not, check:

# Is the EmergencyReclaimDisabledSchedule event present?
kubectl get events --field-selector reason=EmergencyReclaimDisabledSchedule \
  --sort-by='.lastTimestamp'

# If yes, but spec.enabled is still true, the PATCH may have lost a race
# with a user edit. Check the generation on the ScheduledMachine:
kubectl get scheduledmachine/<name> -o jsonpath='{.metadata.generation} {.status.observedGeneration}'

Node Taints¶

Taints not appearing on Node¶

spec.nodeTaints is declared on a ScheduledMachine, the machine is Active, but the Node does not have the expected taints. Walk the NodeTainted condition on the CR first — it tells you exactly which layer is failing.

kubectl get scheduledmachine <name> \
  -o jsonpath='{.status.conditions[?(@.type=="NodeTainted")]}{"\n"}'

Three failure reasons, each with its own fix:

reason=NoNodeYet (status=Unknown) CAPI populated status.nodeRef but the Node object is not yet in the API server. This is usually a few seconds after Machine creation. The Node watch will re-enqueue us automatically — no action needed. If stuck for > 1 min, check that CAPI's Machine actually materialised the Node:

kubectl get machine <name>-machine -o jsonpath='{.status.nodeRef}{"\n"}'
kubectl get nodes <node-name>

reason=NodeNotReady (status=False) Node exists but Ready != True. Kubelet hasn't registered, networking is degraded, or CNI is failing. Look at the Node's own conditions first:

kubectl describe node <node-name> | sed -n '/Conditions:/,/Addresses:/p'

Fix the underlying Node problem; the controller will re-reconcile on the next Node Ready transition.

reason=TaintOwnershipConflict (status=False) An admin taint exists with the same (key, effect) tuple as a declared spec.nodeTaints entry. The controller refuses to overwrite admin-owned taints. Inspect the current state:

kubectl get node <node-name> -o jsonpath='{.spec.taints}{"\n"}'
kubectl get node <node-name> \
  -o jsonpath='{.metadata.annotations.5spot\.finos\.org/applied-taints}{"\n"}'

Resolve by either removing the admin taint (kubectl taint nodes <node> key:effect-) or changing the spec.nodeTaints entry so the (key, effect) no longer collides. Note: the annotation 5spot.finos.org/applied-taints lists the keys we own; any taint not in that list belongs to the admin.

reason=PatchFailed (status=False) A non-404, non-conflict API error on the Node PATCH. Check controller logs for the exact kube error (RBAC rejection, API server unreachable, etc.):

kubectl logs -n 5spot-system -l app=5spot-controller \
  | grep -E "node_taint|appliedNodeTaints"

The controller retries with exponential backoff on transient failures. For RBAC issues, confirm the controller ClusterRole grants patch on nodes.

Error Messages¶

"Resource not owned by this instance"¶

Cause: Multi-instance deployment where this resource is assigned to a different instance.

Solution: This is expected behavior. Each instance handles a subset of resources.

"Failed to evaluate schedule"¶

Cause: Invalid schedule configuration.

Solution: Check schedule syntax: - Days: mon-fri, not monday-friday - Hours: 9-17, not 9:00-17:00 - Timezone: Valid IANA name like America/New_York

"Machine creation failed"¶

Cause: CAPI couldn't create the machine.

Solution: 1. Check CAPI logs: kubectl logs -n capi-system -l control-plane=controller-manager 2. Verify infrastructure provider is configured 3. Check bootstrap template validity

Getting Help¶

Collect Debug Information¶

# Operator version
kubectl get deployment -n 5spot-system 5spot-controller -o jsonpath='{.spec.template.spec.containers[0].image}'

# Full controller logs
kubectl logs -n 5spot-system -l app=5spot-controller --all-containers > controller-logs.txt

# ScheduledMachine YAML
kubectl get scheduledmachine <name> -o yaml > scheduledmachine.yaml

# Events
kubectl get events -A --sort-by='.lastTimestamp' > events.txt

Filing Issues¶

When filing a GitHub issue, include:

5-Spot version
Kubernetes version
CAPI version
Operator logs (sensitive data redacted)
ScheduledMachine YAML
Expected vs actual behavior

Configuration - Operator configuration
Monitoring - Metrics and health checks
Machine Lifecycle - Understanding phases

Troubleshooting¶

Diagnostic Commands¶

Check Operator Status¶

Filter Logs by Correlation ID¶

Check ScheduledMachines¶

Check CAPI Machines¶

Common Issues¶

Machine Stuck in Pending¶

Machine Not Removing¶

Schedule Not Evaluating¶

CAPI Integration Errors¶

Reconciliation Retrying with Increasing Delay¶

Orphan resources after finalizer timeout¶

Emergency Reclaim (Kill Switch)¶

ScheduledMachine stuck in EmergencyRemove¶

Node keeps getting ejected every schedule window¶

Reclaim agent never fires on a known-matching process¶

Reclaim agent crash-loops with EPERM on netlink socket¶

Reclaim agent runs but never observes any events¶

Rapid re-reclaim loop — RapidReReclaim Warning Event¶

EmergencyReclaim event fires but schedule is not disabled¶

Node Taints¶

Taints not appearing on Node¶

Error Messages¶

"Resource not owned by this instance"¶

"Failed to evaluate schedule"¶

"Machine creation failed"¶

Getting Help¶

Collect Debug Information¶

Filing Issues¶

Related¶

ScheduledMachine stuck in `EmergencyRemove`¶

Reclaim agent crash-loops with `EPERM` on netlink socket¶

Rapid re-reclaim loop — `RapidReReclaim` Warning Event¶

`EmergencyReclaim` event fires but schedule is not disabled¶