Configuration¶
5-Spot can be configured through environment variables and command-line arguments.
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
OPERATOR_INSTANCE_ID |
0 |
Instance ID for multi-instance deployments |
OPERATOR_INSTANCE_COUNT |
1 |
Total number of controller instances |
METRICS_PORT |
8080 |
Port for Prometheus metrics endpoint |
HEALTH_PORT |
8081 |
Port for health check endpoints |
RUST_LOG |
info |
Log level (trace, debug, info, warn, error) |
RUST_LOG_FORMAT |
json |
Log format: json (production/SIEM) or text (local dev) |
POD_NAME |
(injected) | Pod name injected via fieldRef (downward API); used as the leader-election holder identity and Kubernetes Event reporter |
ENABLE_LEADER_ELECTION |
false |
Enable Kubernetes Lease-based leader election for multi-replica HA |
LEASE_NAME |
5spot-leader |
Name of the Kubernetes Lease resource used for leader election |
POD_NAMESPACE |
5spot-system |
Namespace in which to create the leader election Lease (injected via fieldRef) |
LEASE_DURATION_SECONDS |
15 |
How long the Lease is considered valid; a new leader is elected if not renewed in time |
LEASE_RENEW_DEADLINE_SECONDS |
10 |
The leader must renew the Lease within this many seconds; grace = duration − deadline |
LEASE_RETRY_PERIOD_SECONDS |
2 |
Documented for ops parity; not a direct LeaseManager parameter |
Command-Line Arguments¶
5spot-controller [OPTIONS]
Options:
--instance-id <ID> Instance ID (default: 0)
--instance-count <COUNT> Total instances (default: 1)
--metrics-port <PORT> Metrics port (default: 8080)
--health-port <PORT> Health port (default: 8081)
--log-format <FORMAT> Log format: json or text (default: json) [env: RUST_LOG_FORMAT]
--enable-leader-election Enable leader election [env: ENABLE_LEADER_ELECTION]
--lease-name <NAME> Lease resource name (default: 5spot-leader) [env: LEASE_NAME]
--lease-namespace <NS> Lease namespace (default: 5spot-system) [env: POD_NAMESPACE]
--lease-duration-secs <SECS> Lease validity duration (default: 15) [env: LEASE_DURATION_SECONDS]
--lease-renew-deadline-secs <SECS> Renew deadline (default: 10) [env: LEASE_RENEW_DEADLINE_SECONDS]
-v, --verbose Enable verbose logging
-h, --help Print help
-V, --version Print version
Log Format¶
The default json format is designed for SIEM ingestion and log aggregation. Switch to text for human-readable output during local development:
# Local development
RUST_LOG=debug RUST_LOG_FORMAT=text cargo run
# Production (default — structured JSON)
RUST_LOG=info RUST_LOG_FORMAT=json ./5spot
Leader Election¶
When deploying multiple replicas for high availability, enable leader election so only one instance reconciles resources at a time:
# Multi-replica HA deployment
ENABLE_LEADER_ELECTION=true \
LEASE_DURATION_SECONDS=15 \
LEASE_RENEW_DEADLINE_SECONDS=10 \
./5spot
Non-leader replicas watch for leadership changes and take over automatically within one LEASE_DURATION_SECONDS window if the leader stops renewing.
Note: Leader election and multi-instance sharding (
OPERATOR_INSTANCE_COUNT > 1) are alternative HA strategies. Use leader election for active/standby HA; use instance sharding to distribute load across all replicas.
Reclaim agent (DaemonSet)¶
The node-side 5spot-reclaim-agent is a separate binary deployed
via DaemonSet (deploy/node-agent/daemonset.yaml). It has its own
flags and environment variables — distinct from the controller — and
is opt-in: nothing happens on a Node until the controller stamps the
5spot.finos.org/reclaim-agent: enabled label, which it does only
when a ScheduledMachine on that Node has a non-empty
spec.killIfCommands. See the emergency-reclaim concept doc
for the full design.
Environment variables¶
| Variable | Default | Description |
|---|---|---|
NODE_NAME |
(required, injected via downward API) | Name of the Node the agent is running on. The agent only PATCHes this Node. |
RECLAIM_PROC_ROOT |
/proc |
Path the agent treats as /proc. Override for sandboxed/test runs only. |
RECLAIM_DETECTOR |
auto |
Process-event source. auto picks netlink on Linux and poll elsewhere. See the Detector subsection below. |
MACHINE_ID_PATH |
/etc/machine-id |
Path the agent reads for the host machine-id (host-identity verification, security-audit Phase 4). The DaemonSet mounts the host file at /host/etc/machine-id and sets this to that path. |
SKIP_HOST_ID_CHECK |
false |
If true, skip the Node.status.nodeInfo.machineID cross-check before PATCH. Use only when /etc/machine-id is genuinely unavailable; production must stay strict. |
Command-line arguments¶
5spot-reclaim-agent [OPTIONS]
Options:
--proc-root <PATH> Filesystem root mapped to /proc
[default: /proc] [env: RECLAIM_PROC_ROOT]
--node-name <NAME> Node to annotate
[env: NODE_NAME]
--detector <DETECTOR> Process-event source: auto | netlink | poll
[default: auto] [env: RECLAIM_DETECTOR]
--machine-id-path <PATH> Host machine-id file
[default: /etc/machine-id] [env: MACHINE_ID_PATH]
--skip-host-id-check Skip the host-identity cross-check before PATCH
(defence-in-depth; default off)
[env: SKIP_HOST_ID_CHECK]
--oneshot Run the detector once and exit
(one-shot tests / smoke verification)
-h, --help Print help
-V, --version Print version
Detector¶
Two detection back-ends ship with the agent. Both produce identical matches and go through the same Node-PATCH path; only the event source differs.
| Mode | Mechanism | Latency | Idle CPU | Linux only? | Extra capability |
|---|---|---|---|---|---|
poll |
Walks /proc every poll_interval_ms |
up to one poll interval (250 ms default) | ~0 | No | None |
netlink |
Subscribes to the kernel proc connector (PROC_EVENT_EXEC) |
<10 ms (kernel-pushed) | sleeps until kernel wakes it | Yes | CAP_NET_ADMIN |
auto (the default):
- Linux → netlink
- macOS / any non-Linux → poll (the netlink subscriber's
constructor returns Unsupported on those platforms)
When to pin --detector=poll explicitly:
- Heavy-exec workloads (
make -j32, compile farms, CI workers) —netlinksees every short-lived process even if it exits in microseconds;pollonly sees processes that survive to the next tick. Under exec stormspollcan be cheaper. CAP_NET_ADMINis unacceptable in your environment (PSArestrictedprofile, hardened cluster policy). The cap is granted only on opted-in nodes via the DaemonSet's pod-levelsecurityContext, but you may have organisational reasons to keep it dropped.- Kernel without
CONFIG_PROC_EVENTS(very rare; some embedded / hardened distros).netlinksocket opens cleanly but no events are ever delivered. See troubleshooting.
Override at deploy time:
# Switch a running DaemonSet to poll mode (no pod restart needed —
# the agent watches its per-node ConfigMap, but env changes need a
# rollout; use kubectl set env to trigger one):
kubectl set env -n 5spot-system ds/5spot-reclaim-agent \
RECLAIM_DETECTOR=poll
ConfigMap Example¶
apiVersion: v1
kind: ConfigMap
metadata:
name: 5spot-config
namespace: 5spot-system
data:
OPERATOR_INSTANCE_COUNT: "1"
ENABLE_LEADER_ELECTION: "true"
LEASE_NAME: "5spot-leader"
LEASE_DURATION_SECONDS: "15"
LEASE_RENEW_DEADLINE_SECONDS: "10"
METRICS_PORT: "8080"
HEALTH_PORT: "8081"
RUST_LOG: "info"
Deployment Configuration¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: 5spot-controller
spec:
replicas: 2 # HA: 1 active leader + 1 standby
template:
spec:
containers:
- name: controller
envFrom:
- configMapRef:
name: 5spot-config
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
RBAC Configuration¶
Minimum required permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: 5spot-controller
rules:
# ScheduledMachine resources
- apiGroups: ["5spot.finos.org"]
resources: ["scheduledmachines"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["5spot.finos.org"]
resources: ["scheduledmachines/status"]
verbs: ["get", "update", "patch"]
# CAPI Machine resources
- apiGroups: ["cluster.x-k8s.io"]
resources: ["machines"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Events for audit trail
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
# Secrets (if using SSH keys)
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
# Leases for leader election
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "create", "update", "patch"]
Related¶
- Monitoring - Metrics and health checks
- Multi-Instance - High availability setup
- Troubleshooting - Common issues