Monitoring¶

5-Spot provides comprehensive monitoring through Prometheus metrics and health endpoints.

Health Endpoints¶

Liveness Probe¶

GET /health
Port: 8081 (default)

Returns 200 OK if the controller is alive.

Readiness Probe¶

GET /ready
Port: 8081 (default)

Returns 200 OK if the controller is ready to accept work.

Kubernetes Configuration¶

livenessProbe:
  httpGet:
    path: /health
    port: 8081
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8081
  initialDelaySeconds: 5
  periodSeconds: 10

Prometheus Metrics¶

Endpoint¶

GET /metrics
Port: 8080 (default)

Available Metrics¶

All metrics use the fivespot_ prefix. The full list lives in src/metrics.rs; the table below is the operator-facing summary.

Reconciler¶

Metric	Type	Labels	Description
`fivespot_reconciliations_total`	Counter	`phase`, `result`	Reconciliation attempts
`fivespot_reconciliation_duration_seconds`	Histogram	`phase`	Reconciliation latency (s)
`fivespot_machines_active`	Gauge	—	Machines currently in `Active` phase
`fivespot_machines_by_phase`	Gauge	`phase`	Machines per lifecycle phase
`fivespot_schedule_evaluations_total`	Counter	`result`	Schedule evaluations by outcome
`fivespot_kill_switch_activations_total`	Gauge	—	Kill-switch activations
`fivespot_controller_info`	Gauge	`version`, `instance_id`	Always 1; carries label metadata
`fivespot_is_leader`	Gauge	—	1 if this instance holds the leader lease
`fivespot_errors_total`	Counter	`error_type`	Errors by type
`fivespot_finalizer_cleanup_timeouts_total`	Counter	—	Finalizer cleanup timeouts (force-removed; possible orphans)

Spot-schedule providers¶

Emitted for ScheduledMachines — spec.schedule is always a spot-schedule provider reference (ADR 0009). Labels are bounded by namespace × provider kind — never the (unbounded) provider or machine name.

Metric	Type	Labels	Description
`fivespot_spot_schedule_resolutions_total`	Counter	`namespace`, `kind`, `result`	Provider resolutions; `result={active\\|inactive\\|unresolved}`
`fivespot_spot_schedule_resolution_errors_total`	Counter	`namespace`, `kind`, `reason`	Unresolved resolutions by `reason` (`ProviderCRDNotInstalled`, `ProviderNotFound`, `StatusActiveMissing`, `ProviderNotReady`) — the hold-last-state signal to alert on
`fivespot_spot_schedule_transitions_total`	Counter	`namespace`, `kind`	Provider `active`⇄`inactive` transitions; a high rate is the flapping signal

Suggested alerts (threat-model D5 flapping + hold-last-state visibility):

# Provider flapping — many active⇄inactive transitions churn machine
# create/delete. Tune the threshold to your machine provisioning cost.
- alert: SpotScheduleProviderFlapping
  expr: sum by (namespace, kind) (rate(fivespot_spot_schedule_transitions_total[15m])) > 0.2
  for: 15m
  labels: { severity: warning }
  annotations:
    summary: "Spot-schedule provider {{ $labels.kind }} is flapping in {{ $labels.namespace }}"

# A provider has been unresolvable for a while — referencing machines are
# holding last-known state (or fail-inactive if never resolved).
- alert: SpotScheduleProviderUnresolved
  expr: sum by (namespace, kind, reason) (rate(fivespot_spot_schedule_resolution_errors_total[10m])) > 0
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: "Spot-schedule provider {{ $labels.kind }} unresolved ({{ $labels.reason }}) in {{ $labels.namespace }}"

Node drain & eviction¶

Metric	Type	Labels	Description
`fivespot_node_drains_total`	Counter	`result`	Node drain attempts
`fivespot_pod_evictions_total`	Counter	`result`	Pod eviction attempts during drain

Emergency reclaim (process-match)¶

Metric	Type	Labels	Description
`fivespot_emergency_drain_duration_seconds`	Histogram	`outcome`	Wall-clock duration of emergency-reclaim drains. `outcome={success\\|timeout\\|error}`
`fivespot_emergency_reclaims_total`	Counter	`namespace`, `name`	Emergency-reclaim events fired per ScheduledMachine
`fivespot_rapid_re_reclaims_total`	Counter	`namespace`, `name`	`RapidReReclaim` warnings emitted per ScheduledMachine (loop-protection — see Emergency reclaim concept)

fivespot_emergency_drain_duration_seconds buckets are sized for the 60 s EMERGENCY_DRAIN_TIMEOUT_SECS ceiling: [0.5, 1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 45.0, 60.0, 90.0] seconds. The outcome label lets dashboards compute success-only P95 and timeout-rate side by side without mixing them in the same query.

Kata config delivery (node agent)¶

Exposed by each 5spot-kata-config-agent DaemonSet pod on its own :8080/metrics (the pod template carries prometheus.io/scrape annotations), not by the controller:

Metric	Type	Labels	Description
`fivespot_kata_config_writes_total`	Counter	—	Kata drop-in files written to the host (rollouts and drift corrections)
`fivespot_kata_config_deletes_total`	Counter	—	Drop-in files removed from the host (GitOps tear-down)
`fivespot_kata_config_drift_corrected_total`	Counter	—	Out-of-band edits rewritten without a service restart. Sustained non-zero rate ⇒ something on the node keeps editing the file
`fivespot_kata_config_restarts_total`	Counter	—	Host k0s-service restarts issued via `nsenter`. Expect exactly one per distinct config change per node; more ⇒ restart loop
`fivespot_kata_config_sync_errors_total`	Counter	—	Failed reconcile ticks (API fetch, host I/O, annotation PATCH, restart)
`fivespot_kata_config_last_sync_timestamp_seconds`	Gauge	—	Unix time of the last successful reconcile tick. Alert when `time() - this` exceeds a few poll intervals (default poll: 30 s)

See the Kata config delivery concept for the architecture and the restart-loop guard these metrics observe.

CapitalMarketsSchedule provider¶

Exposed by the spot-schedule-capital-markets provider controller on its own :8080/metrics (ADR 0006, Phase 5) — not by the main 5-Spot controller. Labels are bounded by namespace × provider object name (provider objects are operator-authored exchange calendars, a small set).

Metric	Type	Labels	Description
`fivespot_capital_markets_active`	Gauge	`namespace`, `name`	Current active state of each `CapitalMarketsSchedule` (1 = market open, 0 = closed)
`fivespot_capital_markets_transitions_total`	Counter	`namespace`, `name`	Active⇄closed transitions; a high rate would indicate a misconfigured calendar

# Provider has not published an active state recently (controller down / RBAC).
- alert: CapitalMarketsScheduleStale
  expr: absent(fivespot_capital_markets_active)
  for: 15m
  labels: { severity: warning }
  annotations:
    summary: "No CapitalMarketsSchedule provider metrics — is the provider running?"

Labels¶

Common labels across metrics:

Label	Description
`phase`	Machine lifecycle phase
`result`	Operation result (`success`, `failure`, `error`)
`outcome`	Outcome label on emergency-drain histogram (`success`, `timeout`, `error`)
`namespace`	Resource namespace (per-SM emergency-reclaim metrics only)
`name`	Resource name (per-SM emergency-reclaim metrics only)
`error_type`	Error category for `fivespot_errors_total`
`version` / `instance_id`	Controller info labels (carried on `fivespot_controller_info`)

ServiceMonitor (Prometheus Operator)¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: 5spot-controller
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: 5spot-controller
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - 5spot-system

Grafana Dashboard¶

Example queries for a Grafana dashboard:

Operator health (leader presence)¶

sum(fivespot_is_leader)

A value of 0 across all replicas is a paging condition — no instance holds the lease, no reconciles are running.

Machines by phase¶

sum by (phase) (fivespot_machines_by_phase)

Reconciliation rate¶

rate(fivespot_reconciliations_total[5m])

Reconciliation latency (P99)¶

histogram_quantile(0.99, rate(fivespot_reconciliation_duration_seconds_bucket[5m]))

Reconciliation failure rate¶

rate(fivespot_reconciliations_total{result="failure"}[5m])

Emergency-reclaim drain — success P95¶

histogram_quantile(
  0.95,
  rate(fivespot_emergency_drain_duration_seconds_bucket{outcome="success"}[10m])
)

Operator SLO: P95 should sit well below 30 s on a healthy fleet. A growing P95 signals workloads with bad terminationGracePeriodSeconds defaults or PodDisruptionBudgets that are clipping the drain.

Emergency-reclaim drain — timeout rate¶

sum(rate(fivespot_emergency_drain_duration_seconds_count{outcome="timeout"}[10m]))

Any non-zero value means the 60 s EMERGENCY_DRAIN_TIMEOUT_SECS ceiling is biting. Cross-reference with fivespot_pod_evictions_total{result="failure"} to identify the workloads that wouldn't evict.

Top emergency-reclaim offenders (per-SM rate)¶

topk(10, sum by (namespace, name) (rate(fivespot_emergency_reclaims_total[1h])))

The 10 ScheduledMachines emergency-reclaimed most often in the last hour. A SM that consistently shows up here is a candidate for killIfCommands review — the user's workload may not be a true "got my box back" emergency.

`RapidReReclaim` warnings¶

sum by (namespace, name) (rate(fivespot_rapid_re_reclaims_total[1h]))

Any non-zero rate is operator-actionable: a user is re-enabling a SM whose conflicting process is still running. Trigger an alert and follow the "Rapid re-reclaim loop" runbook in troubleshooting.

Structured Logging¶

Log Format¶

Logs are emitted as structured JSON by default (controlled by RUST_LOG_FORMAT). Every log line carries standard fields including a reconcile_id correlation field that is unique per reconciliation attempt:

{
  "timestamp": "2026-04-09T00:00:00.123456Z",
  "level": "INFO",
  "fields": {
    "message": "Starting reconciliation",
    "reconcile_id": "deadbeef0001-17f3e2a1b",
    "resource": "my-machine",
    "namespace": "production"
  },
  "target": "five_spot::reconcilers::scheduled_machine",
  "span": { "name": "reconcile" }
}

Correlation IDs¶

The reconcile_id field ties together every log line produced during a single reconciliation. Use it to trace a full reconciliation end-to-end in your log aggregation platform:

# Follow all log lines for a specific reconciliation (jq)
kubectl logs -n 5spot-system -l app=5spot-controller | \
  jq -c 'select(.fields.reconcile_id == "deadbeef0001-17f3e2a1b")'

# Find all reconciliations for a specific resource
kubectl logs -n 5spot-system -l app=5spot-controller | \
  jq -c 'select(.fields.resource == "my-machine")'

# Find all error-phase transitions
kubectl logs -n 5spot-system -l app=5spot-controller | \
  jq -c 'select(.fields.to_phase == "Error")'

Phase Transition Logs¶

Every phase transition logs both the before (from_phase) and after (to_phase) values:

{
  "level": "INFO",
  "fields": {
    "message": "Phase transition",
    "from_phase": "Pending",
    "to_phase": "Active",
    "reconcile_id": "deadbeef0001-17f3e2a1b",
    "resource": "my-machine",
    "namespace": "production"
  }
}

Error Back-off Log Fields¶

When a reconciliation fails, the error policy emits an error-level log line with two additional fields:

Field	Type	Description
`retry_count`	u32	How many consecutive failures have occurred for this resource
`backoff_secs`	u64	Requeue delay chosen for this retry (30 s → 60 → 120 → 240 → 300 s cap)

{
  "level": "ERROR",
  "fields": {
    "message": "Reconciliation error — requeuing with exponential back-off",
    "error": "CAPI operation failed: ...",
    "retry_count": 3,
    "backoff_secs": 240,
    "resource": "my-machine",
    "namespace": "production"
  }
}

The retry count resets to 0 after a successful reconciliation, so a resource that recovers starts fresh on the next failure.

Log Levels¶

Level	Use
`error`	Unrecoverable failures — always investigate
`warn`	Recoverable issues (PDB-blocked eviction, event publish failure)
`info`	Phase transitions, reconciliation start/end
`debug`	Per-pod decisions, API call details
`trace`	Internal state, schedule evaluation

Set via RUST_LOG:

RUST_LOG=info,kube=warn,hyper=warn  # Production default
RUST_LOG=debug                       # Verbose (--verbose flag)

Kubernetes Events¶

5-Spot publishes a Kubernetes Event for every phase transition, visible via:

kubectl describe scheduledmachine <name>
# or
kubectl get events --field-selector involvedObject.kind=ScheduledMachine

Event types and reasons:

Type	Reason	Trigger
Normal	`MachineCreated`	Transition to Active — CAPI resources provisioned
Normal	`ScheduleActive`	Machine entered schedule window
Normal	`ScheduleInactive`	Machine exited schedule window
Normal	`GracePeriodActive`	Graceful shutdown countdown started
Normal	`NodeDraining` / `NodeDrained`	Node drain start / completion
Normal	`MachineDeleted`	Transition to Inactive — CAPI resources removed
Normal	`ScheduleDisabled`	Schedule disabled, machine deactivated
Warning	`ReconcileFailed`	Unrecoverable error — machine in Error phase
Warning	`KillSwitchActivated`	Emergency kill switch triggered
Warning	`EmergencyReclaim`	Reclaim-agent process-match fired; emergency-remove flow started
Warning	`EmergencyReclaimDisabledSchedule`	Step 5 of the flow: `spec.enabled=false` patched (load-bearing — breaks the eject→re-add→re-eject loop)
Warning	`RapidReReclaim`	≥3 reclaims for the same SM within 10 min — the user is re-enabling without first stopping the conflicting process. See troubleshooting

Events are written to the events.k8s.io/v1 API and are immutable once created, providing an auditable state-change trail (SOX §404 / NIST AU-2).

Alerting Examples¶

Prometheus AlertManager Rules¶

groups:
  - name: 5spot
    rules:
      - alert: FiveSpotNoLeader
        expr: sum(fivespot_is_leader) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No 5-Spot controller instance holds the leader lease"

      - alert: FiveSpotHighFailureRate
        expr: rate(fivespot_reconciliations_total{result="failure"}[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High reconciliation failure rate"

      - alert: FiveSpotSlowReconciliation
        expr: histogram_quantile(0.99, rate(fivespot_reconciliation_duration_seconds_bucket[5m])) > 30
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow reconciliation detected (P99 > 30 s)"

      - alert: FiveSpotEmergencyDrainTimeoutRising
        expr: |
          sum(rate(fivespot_emergency_drain_duration_seconds_count{outcome="timeout"}[10m])) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Emergency-reclaim drains hitting the 60 s timeout ceiling"
          description: |
            One or more emergency-reclaim drains failed to evict all pods within
            EMERGENCY_DRAIN_TIMEOUT_SECS (60 s). Cross-reference with
            fivespot_pod_evictions_total{result="failure"} to identify the
            offending workloads.

      - alert: FiveSpotRapidReReclaim
        expr: |
          sum by (namespace, name) (rate(fivespot_rapid_re_reclaims_total[15m])) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ScheduledMachine {{ $labels.namespace }}/{{ $labels.name }} is in a rapid re-reclaim loop"
          description: |
            ≥3 emergency-reclaim events fired within 10 minutes for the same SM —
            the user is re-enabling the schedule without first stopping the
            conflicting process. See troubleshooting.md "Rapid re-reclaim loop"
            runbook.

      - alert: FiveSpotFinalizerCleanupTimeouts
        expr: rate(fivespot_finalizer_cleanup_timeouts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Finalizers being force-removed; possible orphan CAPI resources"

Configuration - Operator configuration
Troubleshooting - Common issues
Multi-Instance - High availability