Configuration¶

5-Spot can be configured through environment variables and command-line arguments.

Environment Variables¶

Variable	Default	Description
`OPERATOR_INSTANCE_ID`	`0`	Instance ID for multi-instance deployments
`OPERATOR_INSTANCE_COUNT`	`1`	Total number of controller instances
`METRICS_PORT`	`8080`	Port for Prometheus metrics endpoint
`HEALTH_PORT`	`8081`	Port for health check endpoints
`RUST_LOG`	`info`	Log level (`trace`, `debug`, `info`, `warn`, `error`)
`RUST_LOG_FORMAT`	`json`	Log format: `json` (production/SIEM) or `text` (local dev)
`POD_NAME`	(injected)	Pod name injected via `fieldRef` (downward API); used as the leader-election holder identity and Kubernetes Event reporter
`ENABLE_LEADER_ELECTION`	`false`	Enable Kubernetes Lease-based leader election for multi-replica HA
`LEASE_NAME`	`5spot-leader`	Name of the Kubernetes `Lease` resource used for leader election
`POD_NAMESPACE`	`5spot-system`	Namespace in which to create the leader election `Lease` (injected via `fieldRef`)
`LEASE_DURATION_SECONDS`	`15`	How long the Lease is considered valid; a new leader is elected if not renewed in time
`LEASE_RENEW_DEADLINE_SECONDS`	`10`	The leader must renew the Lease within this many seconds; grace = duration − deadline
`LEASE_RETRY_PERIOD_SECONDS`	`2`	Documented for ops parity; not a direct LeaseManager parameter

Command-Line Arguments¶

5spot-controller [OPTIONS]

Options:
  --instance-id <ID>                  Instance ID (default: 0)
  --instance-count <COUNT>            Total instances (default: 1)
  --metrics-port <PORT>               Metrics port (default: 8080)
  --health-port <PORT>                Health port (default: 8081)
  --log-format <FORMAT>               Log format: json or text (default: json) [env: RUST_LOG_FORMAT]
  --enable-leader-election            Enable leader election [env: ENABLE_LEADER_ELECTION]
  --lease-name <NAME>                 Lease resource name (default: 5spot-leader) [env: LEASE_NAME]
  --lease-namespace <NS>              Lease namespace (default: 5spot-system) [env: POD_NAMESPACE]
  --lease-duration-secs <SECS>        Lease validity duration (default: 15) [env: LEASE_DURATION_SECONDS]
  --lease-renew-deadline-secs <SECS>  Renew deadline (default: 10) [env: LEASE_RENEW_DEADLINE_SECONDS]
  -v, --verbose                       Enable verbose logging
  -h, --help                          Print help
  -V, --version                       Print version

Log Format¶

The default json format is designed for SIEM ingestion and log aggregation. Switch to text for human-readable output during local development:

# Local development
RUST_LOG=debug RUST_LOG_FORMAT=text cargo run

# Production (default — structured JSON)
RUST_LOG=info RUST_LOG_FORMAT=json ./5spot

Leader Election¶

When deploying multiple replicas for high availability, enable leader election so only one instance reconciles resources at a time:

# Multi-replica HA deployment
ENABLE_LEADER_ELECTION=true \
LEASE_DURATION_SECONDS=15 \
LEASE_RENEW_DEADLINE_SECONDS=10 \
./5spot

Non-leader replicas watch for leadership changes and take over automatically within one LEASE_DURATION_SECONDS window if the leader stops renewing.

Note: Leader election and multi-instance sharding (OPERATOR_INSTANCE_COUNT > 1) are alternative HA strategies. Use leader election for active/standby HA; use instance sharding to distribute load across all replicas.

Reclaim agent (DaemonSet)¶

The node-side 5spot-reclaim-agent is a separate binary deployed via DaemonSet (deploy/node-agent/daemonset.yaml). It has its own flags and environment variables — distinct from the controller — and is opt-in: nothing happens on a Node until the controller stamps the 5spot.finos.org/reclaim-agent: enabled label, which it does only when a ScheduledMachine on that Node has a non-empty spec.killIfCommands. See the emergency-reclaim concept doc for the full design.

Environment variables¶

Variable	Default	Description
`NODE_NAME`	(required, injected via downward API)	Name of the Node the agent is running on. The agent only PATCHes this Node.
`RECLAIM_PROC_ROOT`	`/proc`	Path the agent treats as `/proc`. Override for sandboxed/test runs only.
`RECLAIM_DETECTOR`	`auto`	Process-event source. `auto` picks `netlink` on Linux and `poll` elsewhere. See the Detector subsection below.
`MACHINE_ID_PATH`	`/etc/machine-id`	Path the agent reads for the host machine-id (host-identity verification, security-audit Phase 4). The DaemonSet mounts the host file at `/host/etc/machine-id` and sets this to that path.
`SKIP_HOST_ID_CHECK`	`false`	If `true`, skip the `Node.status.nodeInfo.machineID` cross-check before PATCH. Use only when `/etc/machine-id` is genuinely unavailable; production must stay strict.

Command-line arguments¶

5spot-reclaim-agent [OPTIONS]

Options:
  --proc-root <PATH>           Filesystem root mapped to /proc
                                 [default: /proc] [env: RECLAIM_PROC_ROOT]
  --node-name <NAME>           Node to annotate
                                 [env: NODE_NAME]
  --detector <DETECTOR>        Process-event source: auto | netlink | poll
                                 [default: auto] [env: RECLAIM_DETECTOR]
  --machine-id-path <PATH>     Host machine-id file
                                 [default: /etc/machine-id] [env: MACHINE_ID_PATH]
  --skip-host-id-check         Skip the host-identity cross-check before PATCH
                                 (defence-in-depth; default off)
                                 [env: SKIP_HOST_ID_CHECK]
  --oneshot                    Run the detector once and exit
                                 (one-shot tests / smoke verification)
  -h, --help                   Print help
  -V, --version                Print version

Detector¶

Two detection back-ends ship with the agent. Both produce identical matches and go through the same Node-PATCH path; only the event source differs.

Mode	Mechanism	Latency	Idle CPU	Linux only?	Extra capability
`poll`	Walks `/proc` every `poll_interval_ms`	up to one poll interval (250 ms default)	~0	No	None
`netlink`	Subscribes to the kernel proc connector (`PROC_EVENT_EXEC`)	<10 ms (kernel-pushed)	sleeps until kernel wakes it	Yes	`CAP_NET_ADMIN`

auto (the default): - Linux → netlink - macOS / any non-Linux → poll (the netlink subscriber's constructor returns Unsupported on those platforms)

When to pin --detector=poll explicitly:

Heavy-exec workloads (make -j32, compile farms, CI workers) — netlink sees every short-lived process even if it exits in microseconds; poll only sees processes that survive to the next tick. Under exec storms poll can be cheaper.
CAP_NET_ADMIN is unacceptable in your environment (PSA restricted profile, hardened cluster policy). The cap is granted only on opted-in nodes via the DaemonSet's pod-level securityContext, but you may have organisational reasons to keep it dropped.
Kernel without CONFIG_PROC_EVENTS (very rare; some embedded / hardened distros). netlink socket opens cleanly but no events are ever delivered. See troubleshooting.

Override at deploy time:

# Switch a running DaemonSet to poll mode (no pod restart needed —
# the agent watches its per-node ConfigMap, but env changes need a
# rollout; use kubectl set env to trigger one):
kubectl set env -n 5spot-system ds/5spot-reclaim-agent \
  RECLAIM_DETECTOR=poll

ConfigMap Example¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: 5spot-config
  namespace: 5spot-system
data:
  OPERATOR_INSTANCE_COUNT: "1"
  ENABLE_LEADER_ELECTION: "true"
  LEASE_NAME: "5spot-leader"
  LEASE_DURATION_SECONDS: "15"
  LEASE_RENEW_DEADLINE_SECONDS: "10"
  METRICS_PORT: "8080"
  HEALTH_PORT: "8081"
  RUST_LOG: "info"

Deployment Configuration¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: 5spot-controller
spec:
  replicas: 2  # HA: 1 active leader + 1 standby
  template:
    spec:
      containers:
        - name: controller
          envFrom:
            - configMapRef:
                name: 5spot-config
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

RBAC Configuration¶

Minimum required permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: 5spot-controller
rules:
  # ScheduledMachine resources
  - apiGroups: ["5spot.finos.org"]
    resources: ["scheduledmachines"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: ["5spot.finos.org"]
    resources: ["scheduledmachines/status"]
    verbs: ["get", "update", "patch"]

  # CAPI Machine resources
  - apiGroups: ["cluster.x-k8s.io"]
    resources: ["machines"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

  # Events for audit trail
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "patch"]

  # Secrets (if using SSH keys)
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]

  # Leases for leader election
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update", "patch"]

Monitoring - Metrics and health checks
Multi-Instance - High availability setup
Troubleshooting - Common issues