Kubernetes CSI driver failures: detection, recovery, and version skew

When a workload pod hangs in ContainerCreating with FailedMount or FailedAttach events, the root cause is often a CSI driver pod that has crashed, a node plugin that is missing on the target node, or a version mismatch between the driver and its sidecars. Unlike application pods, CSI drivers sit on the critical path for every volume operation. Their failure is not isolated; it blocks scheduling, provisioning, and recovery.

This article treats CSI drivers as what they are at runtime: specialized pods that register with the kubelet and the API server. You will learn how to confirm driver registration, distinguish controller failures from node-level mount hangs, safely clear stuck VolumeAttachments, and prevent version skew from causing silent RPC failures.

What this means

A CSI driver typically runs as two pod-based components. The controller is a Deployment or StatefulSet that handles volume provisioning, attachment, and expansion by calling out to storage APIs and updating PersistentVolume and VolumeAttachment objects. The node plugin is a DaemonSet that runs on every worker node and implements mount, unmount, and node-stage operations via the kubelet CSI gRPC interface.

When the controller fails, new PVCs stay Pending and volumes cannot be attached to nodes. When a node plugin fails, the kubelet on that node cannot complete mount operations. Because the kubelet delegates volume setup to the driver, a hung or crashed node plugin can stall the volume manager and block subsequent pods from starting, even if the node itself is otherwise healthy.

Common causes

CauseWhat it looks likeFirst thing to check
CSI node plugin CrashLoopBackOff or OOMPods on one node stuck in ContainerCreating; FailedMount eventskubectl get pods -n <csi-ns> -l app=<node-plugin> --field-selector spec.nodeName=<node>
CSI controller unavailableNew PVCs stay Pending; no new volumes provisionedController pod status and logs in the driver namespace
Stuck VolumeAttachmentVolume cannot re-attach after pod migration or node failurekubectl get volumeattachment for objects older than the normal attach or detach timeout
Driver registration failureEvent: driver name not found in registered CSI driverskubectl get csidrivers and compare the exact driver name to the StorageClass provisioner
Version skew between driver and sidecarsRPC errors or silent feature drops in driver logsContainer images for the driver and its sidecars within the same pod
Kubelet volume manager blockedMultiple volume-backed pods hang on one node; CSI driver pod appears healthyKubelet logs for MountVolume.SetUp or operationExecutor timeouts

Quick checks

# Confirm the driver is registered in the cluster
kubectl get csidrivers

# List controller pods that handle provisioning and attachment
kubectl get pods -n <csi-namespace> -l app=<csi-controller>

# List node plugin pods that handle mounts
kubectl get pods -n <csi-namespace> -l app=<csi-node-plugin>

# Find workload pods reporting mount failures
kubectl get events --all-namespaces --field-selector reason=FailedMount

# Check for VolumeAttachments that may be stuck
kubectl get volumeattachment -o custom-columns='NAME:.metadata.name,ATTACHED:.status.attached,AGE:.metadata.creationTimestamp'

# Inspect controller logs for RPC or leader-election errors
kubectl logs -n <csi-namespace> deployment/<csi-controller> --all-containers=true

# Inspect previous container logs after a crash
kubectl logs -n <csi-namespace> <csi-node-pod> --previous

# Check node plugin resource usage for OOM indicators
kubectl top pod -n <csi-namespace> <csi-node-pod>

# Check kubelet storage operation latency on the affected node
curl -sk https://<node>:10250/metrics | grep storage_operation_duration_seconds

How to diagnose it

  1. Confirm driver registration. Run kubectl get csidrivers. If the driver is missing, the node plugin has not registered or the CSIDriver object was not created. Compare the driver name exactly against the StorageClass provisioner field. A mismatch here is a frequent source of the driver name not found event.

  2. Check controller and node plugin pod health. Controller pods must be Running and ready. Node plugin pods must be Running on every schedulable node. If a node plugin is missing on a specific node, check DaemonSet scheduling constraints, node selectors, and taints that might prevent scheduling.

  3. Check workload events. Run kubectl describe pod <stuck-pod> and look for FailedAttach or FailedMount. FailedAttach usually points to the controller or the attach path. FailedMount usually points to the node plugin or the kubelet volume manager.

  4. Check VolumeAttachment state. If a volume was attached to a failed node and never detached, the VolumeAttachment object can block re-attachment elsewhere. Look for objects with an age far exceeding the normal attach or detach cycle for your driver.

  5. Read driver logs with --previous. If the pod is in CrashLoopBackOff, the previous container’s exit reason and last log lines reveal configuration errors, permission failures, or RPC version mismatches.

  6. Check node-level signals. If the node plugin is running but mounts still hang, check kubelet logs for MountVolume.SetUp timeouts and storage_operation_errors_total. A single hung mount can stall the kubelet volume manager and block subsequent pods on that node.

  7. Check for version skew. Compare the driver image tag with sidecar container tags in the same pod. Mismatched gRPC interfaces between sidecars and the driver produce cryptic RPC errors.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
CSI controller pod restart rateController failure blocks all provisioning and attachment operationsMore than 0 restarts in 15 minutes for a critical driver
CSI node plugin pods NotReady per nodeNode-local mount and unmount failures strand workloads on that nodeAny node plugin pod not Running for more than 2 minutes
FailedMount or FailedAttach event rateDirect symptom of CSI driver or kubelet volume path failureSustained rate above 0 for a storage class over 5 minutes
VolumeAttachment object ageOld attachments indicate stuck detach or controller backlogAny VolumeAttachment older than 10 minutes after the consuming pod is deleted
Kubelet storage_operation_duration_secondsMount latency reflects node plugin responsivenessp99 above 30 seconds for volume mount operations
Kubelet storage_operation_errors_totalQuantifies node-level storage operation failuresAny sustained increase over a 5-minute window
Pod CPU throttling on CSI driver podsThrottled sidecars or driver containers can timeout RPC callscontainer_cpu_cfs_throttled_periods_total above 25 percent of total periods
Node DiskPressure or MemoryPressurePressure can evict or OOMKill CSI driver pods, breaking storage on the nodeMemoryPressure=True or DiskPressure=True on a worker

Fixes

If the cause is a crashed or OOMKilled CSI driver pod

Check the pod status for OOMKilled. If so, increase the container memory limit and let the DaemonSet or Deployment recreate the pod. If the termination reason is Error, read the --previous logs for misconfigurations such as missing secrets, invalid flags, or RBAC denials. Restart the pod only after you have addressed the root cause.

If the cause is a stuck VolumeAttachment

Identify the VolumeAttachment blocking the volume. If the workload pod has already been deleted and the volume is truly unused, you can force deletion to allow re-attachment:

# WARNING: Only delete if the volume is unused. Deleting an in-use attachment risks data corruption.
kubectl delete volumeattachment <name>

If the cause is driver registration failure

Verify the driver name in the StorageClass matches the name returned by kubectl get csidrivers. Ensure the node plugin DaemonSet is scheduled on the target node and that its registration socket path is accessible to the kubelet. If the node plugin container fails with RunContainerError, check for shared mount propagation issues on the plugin registration path.

If the cause is version skew

Align the CSI driver image and all sidecar images to a validated release bundle. Do not mix sidecar majors with driver versions from different release cycles. After updating image tags, roll the pods sequentially and verify the driver remains listed in kubectl get csidrivers and that new PVCs reach Bound.

If the cause is a kubelet volume manager hang

If multiple pods with volumes are stuck in ContainerCreating and the CSI driver pod appears healthy, the kubelet volume manager may be blocked on a hung mount. Check kubelet logs for operationExecutor timeouts. Cordon the node, drain workloads safely, and restart the kubelet to clear the volume manager state. This is disruptive and should follow your node maintenance runbook.

Prevention

  • Pin versions. Deploy CSI drivers and sidecars as a validated bundle. Treat sidecar updates as infrastructure changes that require regression testing.
  • Monitor driver pod health. Alert on CSI controller and node plugin restart counts, CrashLoopBackOff states, and OOM kills in the driver namespace.
  • Watch VolumeAttachment age. Alert when any VolumeAttachment exists longer than 15 minutes after its associated pod is deleted.
  • Set resource requests and limits. CSI driver pods need stable CPU and memory to serve mount RPCs within kubelet deadlines. Under-provisioning causes timeouts that resemble driver bugs.
  • Tolerate node maintenance. Ensure the node plugin DaemonSet tolerates common taints so driver containers stay running during cordon and drain operations.
  • Test upgrades outside production. Validate that new sidecar versions do not break existing volume attachments before rolling to production clusters.

How Netdata helps

  • Correlate CSI pod CPU throttling and memory usage with kubelet storage_operation_duration_seconds spikes to distinguish driver saturation from general node resource pressure.
  • Alert on pod restart loops in the CSI driver namespace before workload pods fail to schedule.
  • Track per-node kubelet storage operation error rates to pinpoint whether a mount failure is isolated to one node or cluster-wide.
  • Visualize workload FailedMount event bursts alongside node conditions to isolate the failure domain quickly.
flowchart TD
    A[Workload pod stuck in ContainerCreating] --> B{Check events}
    B -->|FailedAttach| C[Check CSI controller and VolumeAttachment]
    B -->|FailedMount| D[Check node plugin pod on target node]
    C --> E{Controller healthy?}
    E -->|No| F[Fix controller crash or version skew]
    E -->|Yes| G[Force delete stuck VolumeAttachment if safe]
    D --> H{Node plugin healthy?}
    H -->|No| I[Fix node plugin crash or OOM]
    H -->|Yes| J[Check kubelet storage operation latency and logs]