Skip to content

Health States and Heartbeat

The Telovix Console computes a sensor's health from several independent signals: heartbeat timing, trust certificate state, eBPF engine state, resource metrics, kernel guard findings, and BPF event loss counters. Understanding which signal caused a given health state is the first step in diagnosing any fleet issue.


Heartbeat timing

The sensor sends a heartbeat every 15 seconds over mTLS. The Console records the timestamp of each successful heartbeat as last_seen_at.

The sensor also maintains a persistent WebSocket connection that sends a JSON ping every 10 seconds. The Console considers any sensor with an active WebSocket connection to be definitively alive, regardless of the heartbeat staleness window.

ConditionResult
Active WebSocket connectionSensor treated as healthy independent of heartbeat age
No heartbeat for half the stale threshold (default: 45 seconds)heartbeat_delayed reason added (Watch level, only when WebSocket is NOT active)
No heartbeat for the stale threshold (default: 90 seconds)Status becomes stale
No heartbeat for 4x the stale threshold (default: 360 seconds, ~6 minutes)Health state becomes offline

The stale threshold defaults to 90 seconds and is adjustable in Console Settings. The degraded window and offline threshold are derived from it automatically.

Alert delivery latency is also reported per sensor:

  • 500 ms (P95) when the WebSocket stream is active
  • 15,000 ms (heartbeat interval) when only heartbeat delivery is active

Sensor status (status)

The status field reflects the sensor's fleet lifecycle state:

ValueMeaning
healthySensor is sending heartbeats on schedule
staleNo heartbeat received within the stale threshold (default 90 seconds)
disabledOperator explicitly disabled this sensor from the Console
revokedOperator revoked this sensor's identity; all mTLS requests are blocked

The status field is distinct from health_state. A sensor can have status: healthy but health_state: watch due to resource pressure or a certificate approaching expiry.


Sensor health state (health_state)

The health_state is a computed value derived from multiple signals. It has five levels in ascending severity order:

Health stateSeverityTypical reasons
healthy0No issues detected
watch1Non-critical concerns: certificate renewal recommended, delayed heartbeat (without WebSocket), BPF events lost, high CPU/memory/load, kernel guard warnings, sensor disabled
degraded2Active trust alert, certificate renewal due, trust degraded, stale heartbeat (before offline threshold), BPF loss rate above 50 per mille, runtime error (pack failure, event delivery failed)
critical3Sensor revoked, eBPF engine unreachable or crashed, kernel guard failed
offline4No heartbeat for more than stale_secs × 4 (default ~6 minutes)

The Console always uses the most severe applicable state. A single critical condition overrides all Watch-level signals.

Health reasons

Each health state includes a health_reasons array explaining which specific conditions were detected:

Reason codeLevelMeaning
trust_revokedCriticalSensor or trust state is revoked
runtime_unreachableCriticaleBPF engine process is unreachable or exited unexpectedly
kernel_guard_failedCriticalKernel guard hard check failed
trust_degradedDegradedTrust metadata indicates the control path is not fully healthy
renewal_dueDegradedCertificate is within 24 hours of expiry
active_trust_failureDegradedAn active trust alert is open for this sensor
runtime_errorDegradedPack preparation, pack activation, or event delivery failed
sensor_staleDegradedHeartbeats have stopped (before the offline threshold)
sensor_offlineOfflineNo heartbeat for longer than the offline threshold
bpf_events_lostDegraded / WatcheBPF ring buffer overflow (Degraded if >50‰, Watch if any loss)
heartbeat_delayedWatchHeartbeat late but not yet stale, WebSocket not active
cpu_highWatchCPU usage above 85%
memory_highWatchMemory usage above 90% of total
load_highWatch1-minute load average above 2 × CPU core count
kernel_guard_warningWatchKernel guard findings present but not failing
renewal_recommendedWatchCertificate is within 72 hours of expiry
sensor_disabledWatchSensor was disabled by an operator

Not a health signal: no assigned policy pack. A sensor with no pack assigned is in observe-only mode, which is a valid operational state. The Console does not flag it as Watch.


Trust health (trust_health)

Trust health reflects the state of the sensor's mTLS client certificate:

ValueMeaning
healthyCertificate is valid and not within the renewal window
renewal_recommendedWithin 72 hours of expiry (configurable in Console Settings)
renewal_dueWithin 24 hours of expiry, already expired, or a recent trust error (within last 15 minutes) requires attention
degradedTrust metadata or recent connection failures indicate the control path is not fully healthy
revokedThe Console has permanently revoked this sensor's identity

Trust state (trust_state)

The raw enrollment and rotation state:

ValueMeaning
bootstrap_pendingSensor has enrolled but the first heartbeat has not yet been received
trustedNormal operating state; certificate is current
rotatedCertificate was recently renewed; old certificate is still in the overlap window
trust_revokedOperator revoked this sensor

Manual renewal state (manual_renewal_state)

ValueMeaning
idleNo manual renewal in progress
requestedOperator triggered a manual renewal; waiting for the sensor to pick it up
in_progressSensor is executing the renewal (new CSR sent, awaiting new cert)
succeededManual renewal completed successfully
failedManual renewal failed; check manual_renewal_last_error for details

eBPF engine state

The runtime_state field reflects the health of the embedded eBPF engine subprocess:

ValueHealth impactMeaning
liveNoneNormal operation; tetragon_live_adapter mode
compatibilityNoneLegacy adapter mode (tetragon_adapter); collection still active
simulatedNoneDev fixture mode; not used in production
runtime_unreachableCriticalEngine process has stopped responding
runtime_errorCriticalEngine startup or initialization failed
runtime_wait_failedCriticalEngine socket never appeared (startup timeout)
runtime_exitedCriticalEngine process exited unexpectedly
pack_preparation_failedDegradedPolicy pack could not be prepared for this sensor
pack_activation_failedDegradedPolicy pack was prepared but could not be applied to the engine
event_delivery_failedDegradedEvents could not be forwarded to the Console

Only live mode supports enforcement. Enforcement cannot be enabled on a sensor whose runtime_mode is not live.


BPF event loss

The Console tracks two BPF loss metrics from each heartbeat:

  • last_bpf_loss_per_mille: BPF ring buffer lost events per 1,000 events in the most recent session
  • last_bpf_lost_session: raw count of lost events from the last engine session

Any non-zero value triggers a bpf_events_lost Watch reason. A value above 50 per mille (5%) escalates to Degraded. BPF event loss means the sensor is silently missing kernel events; investigate and resolve this before relying on the sensor for security decisions.


Resource pressure signals

The Console reads resource metrics from each heartbeat and applies the following thresholds:

MetricThresholdHealth reason
CPU usage> 85%cpu_high (Watch)
Memory usage> 90% of totalmemory_high (Watch)
1-minute load average> 2 × CPU core countload_high (Watch)

These are Watch-level signals only. They do not escalate to Degraded or Critical on their own.


Kernel guard

The kernel guard runs checks every 30 seconds (after an initial 5-minute delay to let system services finish loading) and reports results in the heartbeat:

  • kernel_guard_ok: false - Critical (hard failure)
  • kernel_guard_findings not empty but kernel_guard_ok not false - Watch (soft warning)

Kernel guard findings include: kprobe baseline count changes, BPF filesystem mount state, unprivileged BPF enabled, and kernel module count changes since baseline.


The needs_attention flag

Each sensor has a needs_attention boolean that the Console uses to surface sensors requiring operator review. It is set to true when any of the following are true:

  • An active trust alert is present
  • Trust health is not healthy
  • Trust state is trust_revoked
  • Enforcement state is enforce_ready or enforced (enforcement active sensors warrant monitoring)
  • Runtime state is one of: runtime_unreachable, runtime_error, runtime_wait_failed, runtime_exited, pack_preparation_failed, pack_activation_failed, event_delivery_failed

Backpressure

The Console can signal backpressure to sensors in the heartbeat response. When PostgreSQL write latency is elevated, the Console sets backpressure_secs to a non-zero value in the heartbeat response. The sensor waits that many additional seconds before its next heartbeat. This is a load-shedding mechanism and does not indicate a health problem with the sensor itself.


Troubleshooting

Sensor shows stale

bash
# On the sensor host
systemctl status telovix-sensor
journalctl -u telovix-sensor -n 50 --no-pager | grep -E "heartbeat|error|connect|TLS"

# Test Console reachability from sensor host
curl -k https://<console-host>:15483/healthz  # port 15483 is the Telovix self-hosted default

# Check system clock (mTLS is sensitive to clock skew)
timedatectl status

Sensor shows critical (runtime_unreachable)

bash
# Engine subprocess likely crashed
journalctl -u telovix-sensor -n 100 --no-pager | grep -E "engine|telovix|error|panic|BPF"

# Check BPF filesystem
ls /sys/fs/bpf/
grep bpf /proc/mounts

Sensor shows degraded (bpf_events_lost)

BPF ring buffer overflow means the kernel is generating more events than the sensor can consume. This can indicate:

  • A process creating an abnormally high rate of syscalls (fork bomb, busy loop)
  • The sensor host is under very high load
  • A large SBOM scan running concurrently

Review the sensor's resource metrics and event pipeline counters in Sensors > [sensor] > Resources.

Trust health shows renewal_due

The sensor renews its certificate automatically at the next heartbeat when the renewal window opens. If automatic renewal is failing, check the manual_renewal_state and manual_renewal_last_error fields from the sensor trust detail view. You can also trigger a manual renewal from Sensors > [sensor] > Renew Certificate.


Further reading

Released under the Telovix Commercial License.