Health States and Heartbeat

The Telovix Console computes a sensor's health from several independent signals: heartbeat timing, trust certificate state, eBPF engine state, resource metrics, kernel guard findings, and BPF event loss counters. Understanding which signal caused a given health state is the first step in diagnosing any fleet issue.

Heartbeat timing

The sensor sends a heartbeat every 15 seconds over mTLS. The Console records the timestamp of each successful heartbeat as last_seen_at.

The sensor also maintains a persistent WebSocket connection that sends a JSON ping every 10 seconds. The Console considers any sensor with an active WebSocket connection to be definitively alive, regardless of the heartbeat staleness window.

Condition	Result
Active WebSocket connection	Sensor treated as `healthy` independent of heartbeat age
No heartbeat for half the stale threshold (default: 45 seconds)	`heartbeat_delayed` reason added (Watch level, only when WebSocket is NOT active)
No heartbeat for the stale threshold (default: 90 seconds)	Status becomes `stale`
No heartbeat for 4x the stale threshold (default: 360 seconds, ~6 minutes)	Health state becomes `offline`

The stale threshold defaults to 90 seconds and is adjustable in Console Settings. The degraded window and offline threshold are derived from it automatically.

Alert delivery latency is also reported per sensor:

500 ms (P95) when the WebSocket stream is active
15,000 ms (heartbeat interval) when only heartbeat delivery is active

Sensor status (`status`)

The status field reflects the sensor's fleet lifecycle state:

Value	Meaning
`healthy`	Sensor is sending heartbeats on schedule
`stale`	No heartbeat received within the stale threshold (default 90 seconds)
`disabled`	Operator explicitly disabled this sensor from the Console
`revoked`	Operator revoked this sensor's identity; all mTLS requests are blocked

The status field is distinct from health_state. A sensor can have status: healthy but health_state: watch due to resource pressure or a certificate approaching expiry.

Sensor health state (`health_state`)

The health_state is a computed value derived from multiple signals. It has five levels in ascending severity order:

Health state	Severity	Typical reasons
`healthy`	0	No issues detected
`watch`	1	Non-critical concerns: certificate renewal recommended, delayed heartbeat (without WebSocket), BPF events lost, high CPU/memory/load, kernel guard warnings, sensor disabled
`degraded`	2	Active trust alert, certificate renewal due, trust degraded, stale heartbeat (before offline threshold), BPF loss rate above 50 per mille, runtime error (pack failure, event delivery failed)
`critical`	3	Sensor revoked, eBPF engine unreachable or crashed, kernel guard failed
`offline`	4	No heartbeat for more than `stale_secs × 4` (default ~6 minutes)

The Console always uses the most severe applicable state. A single critical condition overrides all Watch-level signals.

Health reasons

Each health state includes a health_reasons array explaining which specific conditions were detected:

Reason code	Level	Meaning
`trust_revoked`	Critical	Sensor or trust state is revoked
`runtime_unreachable`	Critical	eBPF engine process is unreachable or exited unexpectedly
`kernel_guard_failed`	Critical	Kernel guard hard check failed
`trust_degraded`	Degraded	Trust metadata indicates the control path is not fully healthy
`renewal_due`	Degraded	Certificate is within 24 hours of expiry
`active_trust_failure`	Degraded	An active trust alert is open for this sensor
`runtime_error`	Degraded	Pack preparation, pack activation, or event delivery failed
`sensor_stale`	Degraded	Heartbeats have stopped (before the offline threshold)
`sensor_offline`	Offline	No heartbeat for longer than the offline threshold
`bpf_events_lost`	Degraded / Watch	eBPF ring buffer overflow (Degraded if >50‰, Watch if any loss)
`heartbeat_delayed`	Watch	Heartbeat late but not yet stale, WebSocket not active
`cpu_high`	Watch	CPU usage above 85%
`memory_high`	Watch	Memory usage above 90% of total
`load_high`	Watch	1-minute load average above 2 × CPU core count
`kernel_guard_warning`	Watch	Kernel guard findings present but not failing
`renewal_recommended`	Watch	Certificate is within 72 hours of expiry
`sensor_disabled`	Watch	Sensor was disabled by an operator

Not a health signal: no assigned policy pack. A sensor with no pack assigned is in observe-only mode, which is a valid operational state. The Console does not flag it as Watch.

Trust health (`trust_health`)

Trust health reflects the state of the sensor's mTLS client certificate:

Value	Meaning
`healthy`	Certificate is valid and not within the renewal window
`renewal_recommended`	Within 72 hours of expiry (configurable in Console Settings)
`renewal_due`	Within 24 hours of expiry, already expired, or a recent trust error (within last 15 minutes) requires attention
`degraded`	Trust metadata or recent connection failures indicate the control path is not fully healthy
`revoked`	The Console has permanently revoked this sensor's identity

Trust state (`trust_state`)

The raw enrollment and rotation state:

Value	Meaning
`bootstrap_pending`	Sensor has enrolled but the first heartbeat has not yet been received
`trusted`	Normal operating state; certificate is current
`rotated`	Certificate was recently renewed; old certificate is still in the overlap window
`trust_revoked`	Operator revoked this sensor

Manual renewal state (`manual_renewal_state`)

Value	Meaning
`idle`	No manual renewal in progress
`requested`	Operator triggered a manual renewal; waiting for the sensor to pick it up
`in_progress`	Sensor is executing the renewal (new CSR sent, awaiting new cert)
`succeeded`	Manual renewal completed successfully
`failed`	Manual renewal failed; check `manual_renewal_last_error` for details

eBPF engine state

The runtime_state field reflects the health of the embedded eBPF engine subprocess:

Value	Health impact	Meaning
`live`	None	Normal operation; `tetragon_live_adapter` mode
`compatibility`	None	Legacy adapter mode (`tetragon_adapter`); collection still active
`simulated`	None	Dev fixture mode; not used in production
`runtime_unreachable`	Critical	Engine process has stopped responding
`runtime_error`	Critical	Engine startup or initialization failed
`runtime_wait_failed`	Critical	Engine socket never appeared (startup timeout)
`runtime_exited`	Critical	Engine process exited unexpectedly
`pack_preparation_failed`	Degraded	Policy pack could not be prepared for this sensor
`pack_activation_failed`	Degraded	Policy pack was prepared but could not be applied to the engine
`event_delivery_failed`	Degraded	Events could not be forwarded to the Console

Only live mode supports enforcement. Enforcement cannot be enabled on a sensor whose runtime_mode is not live.

BPF event loss

The Console tracks two BPF loss metrics from each heartbeat:

last_bpf_loss_per_mille: BPF ring buffer lost events per 1,000 events in the most recent session
last_bpf_lost_session: raw count of lost events from the last engine session

Any non-zero value triggers a bpf_events_lost Watch reason. A value above 50 per mille (5%) escalates to Degraded. BPF event loss means the sensor is silently missing kernel events; investigate and resolve this before relying on the sensor for security decisions.

Resource pressure signals

The Console reads resource metrics from each heartbeat and applies the following thresholds:

Metric	Threshold	Health reason
CPU usage	> 85%	`cpu_high` (Watch)
Memory usage	> 90% of total	`memory_high` (Watch)
1-minute load average	> 2 × CPU core count	`load_high` (Watch)

These are Watch-level signals only. They do not escalate to Degraded or Critical on their own.

Kernel guard

The kernel guard runs checks every 30 seconds (after an initial 5-minute delay to let system services finish loading) and reports results in the heartbeat:

kernel_guard_ok: false - Critical (hard failure)
kernel_guard_findings not empty but kernel_guard_ok not false - Watch (soft warning)

Kernel guard findings include: kprobe baseline count changes, BPF filesystem mount state, unprivileged BPF enabled, and kernel module count changes since baseline.

The `needs_attention` flag

Each sensor has a needs_attention boolean that the Console uses to surface sensors requiring operator review. It is set to true when any of the following are true:

An active trust alert is present
Trust health is not healthy
Trust state is trust_revoked
Enforcement state is enforce_ready or enforced (enforcement active sensors warrant monitoring)
Runtime state is one of: runtime_unreachable, runtime_error, runtime_wait_failed, runtime_exited, pack_preparation_failed, pack_activation_failed, event_delivery_failed

Backpressure

The Console can signal backpressure to sensors in the heartbeat response. When PostgreSQL write latency is elevated, the Console sets backpressure_secs to a non-zero value in the heartbeat response. The sensor waits that many additional seconds before its next heartbeat. This is a load-shedding mechanism and does not indicate a health problem with the sensor itself.

Troubleshooting

Sensor shows `stale`

bash

# On the sensor host
systemctl status telovix-sensor
journalctl -u telovix-sensor -n 50 --no-pager | grep -E "heartbeat|error|connect|TLS"

# Test Console reachability from sensor host
curl -k https://<console-host>:15483/healthz  # port 15483 is the Telovix self-hosted default

# Check system clock (mTLS is sensitive to clock skew)
timedatectl status

Sensor shows `critical` (runtime_unreachable)

bash

# Engine subprocess likely crashed
journalctl -u telovix-sensor -n 100 --no-pager | grep -E "engine|telovix|error|panic|BPF"

# Check BPF filesystem
ls /sys/fs/bpf/
grep bpf /proc/mounts

Sensor shows `degraded` (bpf_events_lost)

BPF ring buffer overflow means the kernel is generating more events than the sensor can consume. This can indicate:

A process creating an abnormally high rate of syscalls (fork bomb, busy loop)
The sensor host is under very high load
A large SBOM scan running concurrently

Review the sensor's resource metrics and event pipeline counters in Sensors > [sensor] > Resources.

Trust health shows `renewal_due`

The sensor renews its certificate automatically at the next heartbeat when the renewal window opens. If automatic renewal is failing, check the manual_renewal_state and manual_renewal_last_error fields from the sensor trust detail view. You can also trigger a manual renewal from Sensors > [sensor] > Renew Certificate.

Health States and Heartbeat ​

Heartbeat timing ​

Sensor status (status) ​

Sensor health state (health_state) ​

Health reasons ​

Trust health (trust_health) ​

Trust state (trust_state) ​

Manual renewal state (manual_renewal_state) ​

eBPF engine state ​

BPF event loss ​

Resource pressure signals ​

Kernel guard ​

The needs_attention flag ​

Backpressure ​

Troubleshooting ​

Sensor shows stale ​

Sensor shows critical (runtime_unreachable) ​

Sensor shows degraded (bpf_events_lost) ​

Trust health shows renewal_due ​

Further reading ​