Kubernetes pod liveness probe killing healthy containers
A container that is processing requests, not OOMKilled, and not crashed can still be restarted repeatedly by the kubelet because a liveness probe failed. The application is alive, but the probe says it is not. This usually shows up as a pod stuck in CrashLoopBackOff with Liveness probe failed events, even though application logs show no fatal error. The restarts waste resources, break active connections, and can trigger cascading load on the cluster as other pods absorb the shifted traffic.
After reading this guide, you will be able to distinguish a genuinely unhealthy container from a falsely failing liveness probe, identify whether the root cause is probe configuration, resource pressure, or kubelet execution lag, and fix it without guessing.
What this means
A liveness probe is meant to detect containers that are deadlocked or otherwise unable to recover. The kubelet executes the probe at a configured interval. If the probe fails enough consecutive times, the kubelet kills and restarts the container. When the container is actually healthy but the probe fails anyway, the restart is a false positive.
The failure mechanism is straightforward but unforgiving. The kubelet runs each probe in its own goroutine. If the probe exceeds its timeout, returns a non-success status, or cannot execute at all, that attempt counts as a failure. After failureThreshold consecutive failures, the container is restarted. The probe gives no partial credit. A container under GC pressure, a kubelet that is CPU-starved, or an endpoint that is simply slow to respond can all trigger a restart.
Startup probes, if configured, gate liveness probes. Until the startup probe succeeds, the kubelet does not evaluate the liveness probe at all. If your application has a slow startup phase and you rely only on initialDelaySeconds, you are exposed to this failure mode.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Aggressive probe config | Container restarts seconds after starting; CrashLoopBackOff appears quickly | Probe timeoutSeconds, periodSeconds, and failureThreshold |
| Slow application startup | Restarts happen only during rollouts or cold starts; pod never stays Running long enough | Whether a startupProbe is defined |
| GC pause or CPU throttling | Probe failures correlate with application GC logs or CPU spikes | Node CPU pressure and container CFS throttling metrics |
| Wrong probe target | Immediate, permanent failures from pod creation | Whether the probe port and path match the running application |
| Kubelet probe execution lag | Intermittent failures when the node is heavily loaded | Kubelet CPU usage and sync loop duration |
| Memory pressure masking | Container responds to probes slowly before eventually being OOMKilled | Container memory usage versus its limit |
Quick checks
# Check pod events for liveness probe failures
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
# Check container restart count and last termination reason
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount} {.status.containerStatuses[0].lastState.terminated.reason}'
# Inspect probe configuration directly
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}'
# Compare liveness and readiness configs
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}{"\n"}{.spec.containers[0].readinessProbe}'
# Check if a startupProbe is configured
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].startupProbe}'
# Check node CPU pressure and kubelet health
kubectl describe node <node-name> | grep -A 5 "Conditions:"
# Check container CPU throttling on the node
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<pod-uid>/<container-id>/cpu.stat
# Check kubelet logs for probe results on the node
journalctl -u kubelet --since "10 minutes ago" | grep -i "probe.*failed\|liveness"
How to diagnose it
Follow this flow to confirm whether the container is truly unhealthy or the probe is lying.
Confirm the container is actually healthy. Check application logs for panics or fatal errors. If the application is serving traffic, processing messages, or completing work, it is likely healthy and the probe is a false positive.
Check pod events for the exact probe failure reason. Look for
Liveness probe failedevents. The message usually states whether it was a timeout, a connection refused, or an HTTP error code. This determines which branch to follow.Inspect the probe configuration. Look at
timeoutSeconds,periodSeconds,failureThreshold, andinitialDelaySeconds. IftimeoutSecondsis 1 and your application has tail latency above 1 second under load, that is the problem.Determine if the failure is during startup. If restarts only happen during pod creation or deployment rollouts, the application startup time exceeds the probe window. The fix is a
startupProbe, not a largerinitialDelaySeconds.Check for node and container resource pressure. Look at node CPU and memory conditions. Check if the container is being CPU-throttled or is near its memory limit. A throttled container cannot respond to probes quickly. A container nearing its memory limit may experience slow allocations or GC pressure.
Check kubelet health on the node. If the kubelet is under CPU pressure or its sync loop is delayed, probe execution itself can lag. Check
kubelet_sync_loop_duration_secondsand kubelet CPU usage on the node.Correlate restart times with application behavior. If restarts align with GC pauses, batch job spikes, or traffic surges, the probe is too sensitive for the application’s normal operating envelope.
Verify the probe endpoint manually.
kubectl execinto the pod and curl the probe endpoint locally. If it responds correctly inside the container but fails from the kubelet, suspect networking or port binding issues.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Container restart count | Direct indicator of probe-induced restarts | Restart count increasing steadily |
Pod phase CrashLoopBackOff | The container is being restarted repeatedly | Pod enters or remains in CrashLoopBackOff |
| Kubelet probe failure rate | Shows raw liveness probe failures | prober_probe_total with result="failed" increasing |
| Container CPU throttling | Throttled containers respond slowly to probes | container_cpu_cfs_throttled_periods_total ratio to container_cpu_cfs_periods_total above 5% |
| Container memory usage | Near-limit memory causes slowdowns or OOM | Usage trending above 80% of limit |
| Node CPU utilization | High node CPU delays kubelet probe goroutines | Node CPU sustained above 80% |
| Kubelet sync loop duration | Slow sync delays all kubelet operations, including probes | kubelet_sync_loop_duration_seconds p99 above 10 seconds |
| Kubelet PLEG relist duration | PLEG lag can delay pod state propagation and probe scheduling | kubelet_pleg_relist_duration_seconds approaching the 3-minute threshold |
Fixes
If the cause is aggressive probe configuration
Increase timeoutSeconds from the default of 1 to a value that covers your application’s p99 internal latency under load, typically 3 to 5 seconds. Increase periodSeconds only if you need to reduce probe overhead; lowering it makes restarts faster but increases kubelet load. Ensure failureThreshold allows for transient slowness without immediately restarting. Do not set failureThreshold to an artificially high number to mask a real problem; if the container is actually deadlocked, you want it restarted.
If the cause is slow startup
Add a startupProbe that checks the same endpoint as the liveness probe. Set its failureThreshold multiplied by periodSeconds to cover the worst-case startup duration. The kubelet will not run the liveness probe until the startup probe succeeds. This is the correct mechanism for slow-starting applications. Relying on a large initialDelaySeconds creates a fixed delay that does not adapt if the container starts faster or slower than expected.
If the cause is resource pressure
Raise the container’s CPU limit if CFS throttling is occurring during probe execution. Raise the memory limit if the container is approaching it and experiencing GC pressure or OOM kills. Ensure the container has resource requests set so the scheduler does not pack it onto an already saturated node. For language runtimes with stop-the-world GC, tune the runtime’s memory settings to reduce pause duration.
If the cause is probe misconfiguration
Verify that the probe port matches the port the application is actually listening on. Verify that the HTTP path returns a status in the 200-399 range and does so quickly. Do not point a liveness probe at an endpoint that depends on downstream services, databases, or external APIs. A liveness probe should test whether the container itself is alive, not whether the entire dependency chain is healthy. Keep readiness probes separate: readiness should catch dependency failures, while liveness should catch deadlocks.
If the cause is kubelet or node pressure
If the node is CPU-saturated, the kubelet may not schedule probe goroutines promptly. If the node is under memory pressure, the kubelet may be busy evicting pods. If PLEG is unhealthy or the sync loop is slow, probe execution is delayed. Cordon the node if necessary and investigate why the kubelet cannot keep up. On dense nodes, reduce pod count or increase node resources.
Prevention
- Define a
startupProbefor any application that takes more than a few seconds to become ready. - Define liveness probes that check only internal application state, never external dependencies.
- Monitor container restart counts per deployment and alert when they increase.
- Set resource requests and limits based on observed startup and steady-state profiles, not guesswork.
- Review probe configurations in CI before deployment; enforce minimum
timeoutSecondsand appropriatefailureThresholdvalues. - Monitor kubelet probe latency and node CPU saturation as leading indicators.
How Netdata helps
- Correlate container restart spikes with CPU throttling, memory pressure, and node saturation on the same timeline to distinguish probe failures from real crashes.
- Monitor kubelet probe failure rates alongside pod health transitions without aggregating away per-pod behavior.
- Track container CFS throttling and memory usage alongside pod phase changes to confirm resource pressure as the root cause.
- Alert on node CPU, memory, and PID pressure that delays kubelet probe execution before it kills containers.
Related guides
- How the Kubernetes control plane works: a mental model for operators: /guides/kubernetes/how-kubernetes-control-plane-works/
- Kubernetes API server slow or unresponsive: causes and fixes: /guides/kubernetes/kubernetes-api-server-slow/
- Kubernetes API server memory pressure: OOM cycle and tuning: /guides/kubernetes/kubernetes-api-server-memory-pressure/
- Kubernetes conntrack exhaustion: dropped connections under load: /guides/kubernetes/kubernetes-conntrack-exhaustion/
- Kubernetes API server etcd latency: detection and cascading failures: /guides/kubernetes/kubernetes-api-server-etcd-latency/
flowchart TD
A[Liveness probe fails] --> B{During startup?}
B -->|Yes| C[Check startupProbe config]
B -->|No| D{Timeout or error?}
D -->|Timeout| E[Check resource pressure
CPU throttling, GC pauses]
D -->|Connection refused| F[Check probe port and path]
E --> G{Node or kubelet slow?}
G -->|Yes| H[Check kubelet CPU and sync loop]
G -->|No| I[Increase timeoutSeconds
and failureThreshold]
F --> J[Fix probe target]
C --> K[Add or tune startupProbe]
H --> L[Relieve node pressure]
I --> M[Monitor restart count]
J --> M
K --> M
L --> M





