Appearance
Health States and Heartbeat
The Telovix Console computes a sensor's health from several independent signals: heartbeat timing, trust certificate state, eBPF engine state, resource metrics, kernel guard findings, and BPF event loss counters. Understanding which signal caused a given health state is the first step in diagnosing any fleet issue.
Heartbeat timing
The sensor sends a heartbeat every 15 seconds over mTLS. The Console records the timestamp of each successful heartbeat as last_seen_at.
The sensor also maintains a persistent WebSocket connection that sends a JSON ping every 10 seconds. The Console considers any sensor with an active WebSocket connection to be definitively alive, regardless of the heartbeat staleness window.
| Condition | Result |
|---|---|
| Active WebSocket connection | Sensor treated as healthy independent of heartbeat age |
| No heartbeat for half the stale threshold (default: 45 seconds) | heartbeat_delayed reason added (Watch level, only when WebSocket is NOT active) |
| No heartbeat for the stale threshold (default: 90 seconds) | Status becomes stale |
| No heartbeat for 4x the stale threshold (default: 360 seconds, ~6 minutes) | Health state becomes offline |
The stale threshold defaults to 90 seconds and is adjustable in Console Settings. The degraded window and offline threshold are derived from it automatically.
Alert delivery latency is also reported per sensor:
500 ms(P95) when the WebSocket stream is active15,000 ms(heartbeat interval) when only heartbeat delivery is active
Sensor status (status)
The status field reflects the sensor's fleet lifecycle state:
| Value | Meaning |
|---|---|
healthy | Sensor is sending heartbeats on schedule |
stale | No heartbeat received within the stale threshold (default 90 seconds) |
disabled | Operator explicitly disabled this sensor from the Console |
revoked | Operator revoked this sensor's identity; all mTLS requests are blocked |
The status field is distinct from health_state. A sensor can have status: healthy but health_state: watch due to resource pressure or a certificate approaching expiry.
Sensor health state (health_state)
The health_state is a computed value derived from multiple signals. It has five levels in ascending severity order:
| Health state | Severity | Typical reasons |
|---|---|---|
healthy | 0 | No issues detected |
watch | 1 | Non-critical concerns: certificate renewal recommended, delayed heartbeat (without WebSocket), BPF events lost, high CPU/memory/load, kernel guard warnings, sensor disabled |
degraded | 2 | Active trust alert, certificate renewal due, trust degraded, stale heartbeat (before offline threshold), BPF loss rate above 50 per mille, runtime error (pack failure, event delivery failed) |
critical | 3 | Sensor revoked, eBPF engine unreachable or crashed, kernel guard failed |
offline | 4 | No heartbeat for more than stale_secs × 4 (default ~6 minutes) |
The Console always uses the most severe applicable state. A single critical condition overrides all Watch-level signals.
Health reasons
Each health state includes a health_reasons array explaining which specific conditions were detected:
| Reason code | Level | Meaning |
|---|---|---|
trust_revoked | Critical | Sensor or trust state is revoked |
runtime_unreachable | Critical | eBPF engine process is unreachable or exited unexpectedly |
kernel_guard_failed | Critical | Kernel guard hard check failed |
trust_degraded | Degraded | Trust metadata indicates the control path is not fully healthy |
renewal_due | Degraded | Certificate is within 24 hours of expiry |
active_trust_failure | Degraded | An active trust alert is open for this sensor |
runtime_error | Degraded | Pack preparation, pack activation, or event delivery failed |
sensor_stale | Degraded | Heartbeats have stopped (before the offline threshold) |
sensor_offline | Offline | No heartbeat for longer than the offline threshold |
bpf_events_lost | Degraded / Watch | eBPF ring buffer overflow (Degraded if >50‰, Watch if any loss) |
heartbeat_delayed | Watch | Heartbeat late but not yet stale, WebSocket not active |
cpu_high | Watch | CPU usage above 85% |
memory_high | Watch | Memory usage above 90% of total |
load_high | Watch | 1-minute load average above 2 × CPU core count |
kernel_guard_warning | Watch | Kernel guard findings present but not failing |
renewal_recommended | Watch | Certificate is within 72 hours of expiry |
sensor_disabled | Watch | Sensor was disabled by an operator |
Not a health signal: no assigned policy pack. A sensor with no pack assigned is in observe-only mode, which is a valid operational state. The Console does not flag it as Watch.
Trust health (trust_health)
Trust health reflects the state of the sensor's mTLS client certificate:
| Value | Meaning |
|---|---|
healthy | Certificate is valid and not within the renewal window |
renewal_recommended | Within 72 hours of expiry (configurable in Console Settings) |
renewal_due | Within 24 hours of expiry, already expired, or a recent trust error (within last 15 minutes) requires attention |
degraded | Trust metadata or recent connection failures indicate the control path is not fully healthy |
revoked | The Console has permanently revoked this sensor's identity |
Trust state (trust_state)
The raw enrollment and rotation state:
| Value | Meaning |
|---|---|
bootstrap_pending | Sensor has enrolled but the first heartbeat has not yet been received |
trusted | Normal operating state; certificate is current |
rotated | Certificate was recently renewed; old certificate is still in the overlap window |
trust_revoked | Operator revoked this sensor |
Manual renewal state (manual_renewal_state)
| Value | Meaning |
|---|---|
idle | No manual renewal in progress |
requested | Operator triggered a manual renewal; waiting for the sensor to pick it up |
in_progress | Sensor is executing the renewal (new CSR sent, awaiting new cert) |
succeeded | Manual renewal completed successfully |
failed | Manual renewal failed; check manual_renewal_last_error for details |
eBPF engine state
The runtime_state field reflects the health of the embedded eBPF engine subprocess:
| Value | Health impact | Meaning |
|---|---|---|
live | None | Normal operation; tetragon_live_adapter mode |
compatibility | None | Legacy adapter mode (tetragon_adapter); collection still active |
simulated | None | Dev fixture mode; not used in production |
runtime_unreachable | Critical | Engine process has stopped responding |
runtime_error | Critical | Engine startup or initialization failed |
runtime_wait_failed | Critical | Engine socket never appeared (startup timeout) |
runtime_exited | Critical | Engine process exited unexpectedly |
pack_preparation_failed | Degraded | Policy pack could not be prepared for this sensor |
pack_activation_failed | Degraded | Policy pack was prepared but could not be applied to the engine |
event_delivery_failed | Degraded | Events could not be forwarded to the Console |
Only live mode supports enforcement. Enforcement cannot be enabled on a sensor whose runtime_mode is not live.
BPF event loss
The Console tracks two BPF loss metrics from each heartbeat:
last_bpf_loss_per_mille: BPF ring buffer lost events per 1,000 events in the most recent sessionlast_bpf_lost_session: raw count of lost events from the last engine session
Any non-zero value triggers a bpf_events_lost Watch reason. A value above 50 per mille (5%) escalates to Degraded. BPF event loss means the sensor is silently missing kernel events; investigate and resolve this before relying on the sensor for security decisions.
Resource pressure signals
The Console reads resource metrics from each heartbeat and applies the following thresholds:
| Metric | Threshold | Health reason |
|---|---|---|
| CPU usage | > 85% | cpu_high (Watch) |
| Memory usage | > 90% of total | memory_high (Watch) |
| 1-minute load average | > 2 × CPU core count | load_high (Watch) |
These are Watch-level signals only. They do not escalate to Degraded or Critical on their own.
Kernel guard
The kernel guard runs checks every 30 seconds (after an initial 5-minute delay to let system services finish loading) and reports results in the heartbeat:
kernel_guard_ok: false- Critical (hard failure)kernel_guard_findingsnot empty butkernel_guard_oknot false - Watch (soft warning)
Kernel guard findings include: kprobe baseline count changes, BPF filesystem mount state, unprivileged BPF enabled, and kernel module count changes since baseline.
The needs_attention flag
Each sensor has a needs_attention boolean that the Console uses to surface sensors requiring operator review. It is set to true when any of the following are true:
- An active trust alert is present
- Trust health is not
healthy - Trust state is
trust_revoked - Enforcement state is
enforce_readyorenforced(enforcement active sensors warrant monitoring) - Runtime state is one of:
runtime_unreachable,runtime_error,runtime_wait_failed,runtime_exited,pack_preparation_failed,pack_activation_failed,event_delivery_failed
Backpressure
The Console can signal backpressure to sensors in the heartbeat response. When PostgreSQL write latency is elevated, the Console sets backpressure_secs to a non-zero value in the heartbeat response. The sensor waits that many additional seconds before its next heartbeat. This is a load-shedding mechanism and does not indicate a health problem with the sensor itself.
Troubleshooting
Sensor shows stale
bash
# On the sensor host
systemctl status telovix-sensor
journalctl -u telovix-sensor -n 50 --no-pager | grep -E "heartbeat|error|connect|TLS"
# Test Console reachability from sensor host
curl -k https://<console-host>:15483/healthz # port 15483 is the Telovix self-hosted default
# Check system clock (mTLS is sensitive to clock skew)
timedatectl statusSensor shows critical (runtime_unreachable)
bash
# Engine subprocess likely crashed
journalctl -u telovix-sensor -n 100 --no-pager | grep -E "engine|telovix|error|panic|BPF"
# Check BPF filesystem
ls /sys/fs/bpf/
grep bpf /proc/mountsSensor shows degraded (bpf_events_lost)
BPF ring buffer overflow means the kernel is generating more events than the sensor can consume. This can indicate:
- A process creating an abnormally high rate of syscalls (fork bomb, busy loop)
- The sensor host is under very high load
- A large SBOM scan running concurrently
Review the sensor's resource metrics and event pipeline counters in Sensors > [sensor] > Resources.
Trust health shows renewal_due
The sensor renews its certificate automatically at the next heartbeat when the renewal window opens. If automatic renewal is failing, check the manual_renewal_state and manual_renewal_last_error fields from the sensor trust detail view. You can also trigger a manual renewal from Sensors > [sensor] > Renew Certificate.