Kubernetes container runtime shim failures: containerd, CRI-O troubleshooting

Pods stuck in ContainerCreating, nodes flapping NotReady, and PLEG timeouts that clear only after a node reboot usually point to the container runtime shim layer, not the kubelet or network. The shim sits between the kubelet and the low-level runtime. When it hangs, crashes, or leaks, the kubelet cannot enumerate containers, start sandboxes, or reap terminated pods. Existing containers may keep running, but the node stops accepting new work.

This guide covers how to distinguish a shim failure from a CNI or kubelet issue, identify orphaned shims before they exhaust node PIDs, and recover without unnecessary reboots. It focuses on containerd and CRI-O.

What this means

The kubelet drives pod lifecycle through the CRI (Container Runtime Interface) over a local Unix socket. containerd listens on /run/containerd/containerd.sock; CRI-O listens on /run/crio/crio.sock. Each pod is delegated to a monitor process: containerd-shim for containerd, or conmon for CRI-O. The monitor maintains OCI runtime state and reports exit codes back to the runtime daemon.

The kubelet’s PLEG (Pod Lifecycle Event Generator) periodically asks the runtime to list all containers and sandboxes. If a shim or monitor is hung, orphaned, or slow to respond, that enumeration delays. When PLEG relist exceeds its deadline, the kubelet marks the node NotReady. The runtime daemon may still be running, but it is blocked on monitor I/O or state.

Common causes

CauseWhat it looks likeFirst thing to check
Hung or orphaned shim/monitorPLEG relist duration climbs; node flaps NotReady; containerd-shim or conmon count exceeds running containersps monitor count vs crictl ps -a count
Runtime socket unresponsivecrictl hangs or returns connection errors; kubelet logs show CRI timeoutcrictl info and socket file presence
Runtime daemon crash or deadlockNode Ready goes False; no new pods start; existing pods may still runsystemctl status containerd or crio
PID exhaustion from shim accumulationfork/exec ... resource temporarily unavailable; node cannot spawn processesRunning PID count vs /proc/sys/kernel/pid_max
Cgroup driver mismatchContainers start but limits are ignored; unexpected OOM killskubelet and runtime cgroup driver configs

Quick checks

These commands are read-only unless noted.

# Check CRI socket exists and is accessible
ls -la /run/containerd/containerd.sock /run/crio/crio.sock 2>/dev/null

# Test runtime responsiveness (containerd)
time crictl --runtime-endpoint unix:///run/containerd/containerd.sock info

# Test runtime responsiveness (CRI-O)
time crictl --runtime-endpoint unix:///run/crio/crio.sock info

# Count containerd shim processes
ps aux | grep -c '[c]ontainerd-shim'

# Count CRI-O monitor processes
ps aux | grep -c '[c]onmon'

# List all containers via CRI, independent of kubelet
crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a

# Check runtime service health
systemctl status containerd
systemctl status crio

# Check kubelet logs for PLEG timeout or relist errors
journalctl -u kubelet --since "5 minutes ago" | grep -iE 'pleg|relist'

# Query PLEG relist duration from metrics (requires kubelet or API server access)
kubectl get --raw "/api/v1/nodes/<node_name>/proxy/metrics" | grep kubelet_pleg_relist_duration_seconds

# Compare running PID count to system limit
echo "pids: $(find /proc -maxdepth 1 -type d -name '[0-9]*' | wc -l) / $(cat /proc/sys/kernel/pid_max)"

# Check runtime daemon logs for errors
journalctl -u containerd --since "5 minutes ago" | grep -iE 'error|fail|shim'
journalctl -u crio --since "5 minutes ago" | grep -iE 'error|fail|conmon'

How to diagnose it

  1. Isolate the scope. If only one node is affected, suspect a local runtime or monitor failure. If many nodes fail simultaneously, look for a control plane, network, or cluster-wide DaemonSet issue.
  2. Verify the CRI socket. Run crictl info against the runtime endpoint. If the command hangs or returns a connection error, the runtime is not accepting CRI requests. This is a hard failure. Check that the socket file exists.
  3. Compare shim/monitor count to container count. For containerd, count containerd-shim processes with ps. For CRI-O, count conmon processes. Compare to the number of containers reported by crictl ps -a. A large discrepancy indicates orphaned monitors holding PID and memory resources.
  4. Check PLEG metrics. Query kubelet_pleg_relist_duration_seconds from the node metrics endpoint or via the API server proxy. If the p99 climbs above 10 seconds, the runtime is slow to enumerate containers. This is the leading indicator before a NotReady transition.
  5. Inspect runtime logs. For containerd, read journalctl -u containerd. For CRI-O, read journalctl -u crio. Look for OOM kills, segfaults, storage driver errors, or repeated shim or conmon start failures.
  6. Check node PID and fd saturation. If the node is near pid_max or the runtime has too many open file descriptors, new shims cannot be spawned. Look for resource temporarily unavailable in runtime or kubelet logs.
  7. Check cgroup driver alignment. The kubelet and the runtime must both use systemd or both use cgroupfs. A mismatch causes containers to start while resource limits are silently ignored, which can lead to unexpected OOM kills and runtime stress.
  8. Correlate with kubelet CRI metrics. Look at kubelet_runtime_operations_duration_seconds and kubelet_runtime_operations_errors_total. High latency or errors on list_containers or list_podsandbox confirm the runtime is the bottleneck, not the kubelet sync loop.
flowchart TD
    A[Shim process hangs or orphans] --> B[Runtime slows on container enumeration]
    B --> C[PLEG relist duration exceeds deadline]
    C --> D[Kubelet reports node NotReady]
    D --> E[Scheduler stops sending pods]
    D --> F[Kubelet cannot start or terminate containers]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
PLEG relist durationMeasures how fast the runtime can list containers; drives node readinessp99 > 10s sustained
CRI operation latencyKubelet view of runtime responsivenessp99 > 5s for list_containers or list_podsandbox
CRI operation errorsDirect indicator of failed CRI callsAny sustained error rate > 0
Node Ready conditionAggregates PLEG and runtime healthTransition to False or Unknown for > 1 minute
Shim/monitor process countOrphaned shims leak PIDs and memoryCount > 1.5x running container count
PID pressurePrevents new shim and container creationPIDPressure=True or usage > 90% of pid_max
Kubelet CPU/memoryResource-starved kubelet cannot drive CRICPU throttling or RSS approaching limit

Fixes

If the cause is a hung or orphaned shim/monitor

WARNING: Killing shims or monitors can leave containers in an unknown state. Target only confirmed orphans. Prefer targeted cleanup over a runtime restart.

Identify the specific process. For containerd, find containerd-shim processes with no corresponding container in crictl ps -a. For CRI-O, find conmon processes with no corresponding container. Cordon the node, then kill the orphan process. After the process exits, the runtime may reap the container. If a pod remains stuck in Terminating, force-delete the pod object from the API server:

kubectl delete pod <pod_name> --force

If the cause is runtime daemon failure

Restarting the runtime daemon is disruptive. Cordon the node first to prevent new pod scheduling. Only restart after read-only checks confirm the daemon is not responding to CRI.

# Disruptive: restart the runtime daemon
systemctl restart containerd
# or
systemctl restart crio

Existing containers managed by independent shim or monitor processes may survive the restart, but new pods will be blocked until the runtime recovers. Verify recovery with crictl info before uncordoning.

If the cause is PID exhaustion

Increase pid_max for immediate relief. This change is not persistent.

# Disruptive if workloads are spawning rapidly; increases system-wide limit
echo 4194304 > /proc/sys/kernel/pid_max

Persist the change in /etc/sysctl.conf or a drop-in under /etc/sysctl.d/. Then identify and clean up orphaned shim or monitor processes. Set kubelet --pod-max-pids to limit per-pod process explosions. Review workloads for fork bombs or runaway thread pools.

If the cause is cgroup driver mismatch

Align the kubelet and runtime configurations so both specify the same cgroup driver. Restart the runtime and kubelet after changing the driver. A mismatched driver causes containers to run without effective resource limits, which amplifies memory and CPU pressure and can cascade into runtime instability.

Prevention

  • Monitor PLEG relist duration and CRI operation latency per node pool. Alert on sustained deviation from baseline before the node transitions to NotReady.
  • Monitor the ratio of shim/monitor processes to running containers. Automated alerts on orphans prevent PID exhaustion.
  • Keep kubelet, container runtime, and kernel versions within the supported skew window for your Kubernetes version.
  • Cordon the node before any runtime restart. Do not treat a runtime restart as a harmless first response.
  • Enforce per-pod PID limits and maintain node-level PID headroom to absorb shim leaks.

How Netdata helps

  • Correlate kubelet_pleg_relist_duration_seconds spikes with node CPU, memory, and disk I/O to distinguish runtime slowness from resource pressure.
  • Track kubelet_runtime_operations_errors_total to surface CRI failures without manual log diving.
  • Monitor PID usage and PIDPressure conditions alongside process counts to catch leaks before they exhaust the node.
  • Overlay container runtime daemon CPU and memory with kubelet metrics to pinpoint whether the runtime or the shim is the bottleneck.