Kubernetes CSI driver failures: detection, recovery, and version skew
When a workload pod hangs in ContainerCreating with FailedMount or FailedAttach events, the root cause is often a CSI driver pod that has crashed, a node plugin that is missing on the target node, or a version mismatch between the driver and its sidecars. Unlike application pods, CSI drivers sit on the critical path for every volume operation. Their failure is not isolated; it blocks scheduling, provisioning, and recovery.
This article treats CSI drivers as what they are at runtime: specialized pods that register with the kubelet and the API server. You will learn how to confirm driver registration, distinguish controller failures from node-level mount hangs, safely clear stuck VolumeAttachments, and prevent version skew from causing silent RPC failures.
What this means
A CSI driver typically runs as two pod-based components. The controller is a Deployment or StatefulSet that handles volume provisioning, attachment, and expansion by calling out to storage APIs and updating PersistentVolume and VolumeAttachment objects. The node plugin is a DaemonSet that runs on every worker node and implements mount, unmount, and node-stage operations via the kubelet CSI gRPC interface.
When the controller fails, new PVCs stay Pending and volumes cannot be attached to nodes. When a node plugin fails, the kubelet on that node cannot complete mount operations. Because the kubelet delegates volume setup to the driver, a hung or crashed node plugin can stall the volume manager and block subsequent pods from starting, even if the node itself is otherwise healthy.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| CSI node plugin CrashLoopBackOff or OOM | Pods on one node stuck in ContainerCreating; FailedMount events | kubectl get pods -n <csi-ns> -l app=<node-plugin> --field-selector spec.nodeName=<node> |
| CSI controller unavailable | New PVCs stay Pending; no new volumes provisioned | Controller pod status and logs in the driver namespace |
| Stuck VolumeAttachment | Volume cannot re-attach after pod migration or node failure | kubectl get volumeattachment for objects older than the normal attach or detach timeout |
| Driver registration failure | Event: driver name not found in registered CSI drivers | kubectl get csidrivers and compare the exact driver name to the StorageClass provisioner |
| Version skew between driver and sidecars | RPC errors or silent feature drops in driver logs | Container images for the driver and its sidecars within the same pod |
| Kubelet volume manager blocked | Multiple volume-backed pods hang on one node; CSI driver pod appears healthy | Kubelet logs for MountVolume.SetUp or operationExecutor timeouts |
Quick checks
# Confirm the driver is registered in the cluster
kubectl get csidrivers
# List controller pods that handle provisioning and attachment
kubectl get pods -n <csi-namespace> -l app=<csi-controller>
# List node plugin pods that handle mounts
kubectl get pods -n <csi-namespace> -l app=<csi-node-plugin>
# Find workload pods reporting mount failures
kubectl get events --all-namespaces --field-selector reason=FailedMount
# Check for VolumeAttachments that may be stuck
kubectl get volumeattachment -o custom-columns='NAME:.metadata.name,ATTACHED:.status.attached,AGE:.metadata.creationTimestamp'
# Inspect controller logs for RPC or leader-election errors
kubectl logs -n <csi-namespace> deployment/<csi-controller> --all-containers=true
# Inspect previous container logs after a crash
kubectl logs -n <csi-namespace> <csi-node-pod> --previous
# Check node plugin resource usage for OOM indicators
kubectl top pod -n <csi-namespace> <csi-node-pod>
# Check kubelet storage operation latency on the affected node
curl -sk https://<node>:10250/metrics | grep storage_operation_duration_seconds
How to diagnose it
Confirm driver registration. Run
kubectl get csidrivers. If the driver is missing, the node plugin has not registered or theCSIDriverobject was not created. Compare the driver name exactly against the StorageClassprovisionerfield. A mismatch here is a frequent source of thedriver name not foundevent.Check controller and node plugin pod health. Controller pods must be Running and ready. Node plugin pods must be Running on every schedulable node. If a node plugin is missing on a specific node, check DaemonSet scheduling constraints, node selectors, and taints that might prevent scheduling.
Check workload events. Run
kubectl describe pod <stuck-pod>and look forFailedAttachorFailedMount.FailedAttachusually points to the controller or the attach path.FailedMountusually points to the node plugin or the kubelet volume manager.Check VolumeAttachment state. If a volume was attached to a failed node and never detached, the
VolumeAttachmentobject can block re-attachment elsewhere. Look for objects with an age far exceeding the normal attach or detach cycle for your driver.Read driver logs with
--previous. If the pod is inCrashLoopBackOff, the previous container’s exit reason and last log lines reveal configuration errors, permission failures, or RPC version mismatches.Check node-level signals. If the node plugin is running but mounts still hang, check kubelet logs for
MountVolume.SetUptimeouts andstorage_operation_errors_total. A single hung mount can stall the kubelet volume manager and block subsequent pods on that node.Check for version skew. Compare the driver image tag with sidecar container tags in the same pod. Mismatched gRPC interfaces between sidecars and the driver produce cryptic RPC errors.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| CSI controller pod restart rate | Controller failure blocks all provisioning and attachment operations | More than 0 restarts in 15 minutes for a critical driver |
| CSI node plugin pods NotReady per node | Node-local mount and unmount failures strand workloads on that node | Any node plugin pod not Running for more than 2 minutes |
FailedMount or FailedAttach event rate | Direct symptom of CSI driver or kubelet volume path failure | Sustained rate above 0 for a storage class over 5 minutes |
| VolumeAttachment object age | Old attachments indicate stuck detach or controller backlog | Any VolumeAttachment older than 10 minutes after the consuming pod is deleted |
Kubelet storage_operation_duration_seconds | Mount latency reflects node plugin responsiveness | p99 above 30 seconds for volume mount operations |
Kubelet storage_operation_errors_total | Quantifies node-level storage operation failures | Any sustained increase over a 5-minute window |
| Pod CPU throttling on CSI driver pods | Throttled sidecars or driver containers can timeout RPC calls | container_cpu_cfs_throttled_periods_total above 25 percent of total periods |
Node DiskPressure or MemoryPressure | Pressure can evict or OOMKill CSI driver pods, breaking storage on the node | MemoryPressure=True or DiskPressure=True on a worker |
Fixes
If the cause is a crashed or OOMKilled CSI driver pod
Check the pod status for OOMKilled. If so, increase the container memory limit and let the DaemonSet or Deployment recreate the pod. If the termination reason is Error, read the --previous logs for misconfigurations such as missing secrets, invalid flags, or RBAC denials. Restart the pod only after you have addressed the root cause.
If the cause is a stuck VolumeAttachment
Identify the VolumeAttachment blocking the volume. If the workload pod has already been deleted and the volume is truly unused, you can force deletion to allow re-attachment:
# WARNING: Only delete if the volume is unused. Deleting an in-use attachment risks data corruption.
kubectl delete volumeattachment <name>
If the cause is driver registration failure
Verify the driver name in the StorageClass matches the name returned by kubectl get csidrivers. Ensure the node plugin DaemonSet is scheduled on the target node and that its registration socket path is accessible to the kubelet. If the node plugin container fails with RunContainerError, check for shared mount propagation issues on the plugin registration path.
If the cause is version skew
Align the CSI driver image and all sidecar images to a validated release bundle. Do not mix sidecar majors with driver versions from different release cycles. After updating image tags, roll the pods sequentially and verify the driver remains listed in kubectl get csidrivers and that new PVCs reach Bound.
If the cause is a kubelet volume manager hang
If multiple pods with volumes are stuck in ContainerCreating and the CSI driver pod appears healthy, the kubelet volume manager may be blocked on a hung mount. Check kubelet logs for operationExecutor timeouts. Cordon the node, drain workloads safely, and restart the kubelet to clear the volume manager state. This is disruptive and should follow your node maintenance runbook.
Prevention
- Pin versions. Deploy CSI drivers and sidecars as a validated bundle. Treat sidecar updates as infrastructure changes that require regression testing.
- Monitor driver pod health. Alert on CSI controller and node plugin restart counts, CrashLoopBackOff states, and OOM kills in the driver namespace.
- Watch VolumeAttachment age. Alert when any VolumeAttachment exists longer than 15 minutes after its associated pod is deleted.
- Set resource requests and limits. CSI driver pods need stable CPU and memory to serve mount RPCs within kubelet deadlines. Under-provisioning causes timeouts that resemble driver bugs.
- Tolerate node maintenance. Ensure the node plugin DaemonSet tolerates common taints so driver containers stay running during cordon and drain operations.
- Test upgrades outside production. Validate that new sidecar versions do not break existing volume attachments before rolling to production clusters.
How Netdata helps
- Correlate CSI pod CPU throttling and memory usage with kubelet
storage_operation_duration_secondsspikes to distinguish driver saturation from general node resource pressure. - Alert on pod restart loops in the CSI driver namespace before workload pods fail to schedule.
- Track per-node kubelet storage operation error rates to pinpoint whether a mount failure is isolated to one node or cluster-wide.
- Visualize workload
FailedMountevent bursts alongside node conditions to isolate the failure domain quickly.
Related guides
- How the Kubernetes control plane works: a mental model for operators
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes conntrack exhaustion: dropped connections under load
flowchart TD
A[Workload pod stuck in ContainerCreating] --> B{Check events}
B -->|FailedAttach| C[Check CSI controller and VolumeAttachment]
B -->|FailedMount| D[Check node plugin pod on target node]
C --> E{Controller healthy?}
E -->|No| F[Fix controller crash or version skew]
E -->|Yes| G[Force delete stuck VolumeAttachment if safe]
D --> H{Node plugin healthy?}
H -->|No| I[Fix node plugin crash or OOM]
H -->|Yes| J[Check kubelet storage operation latency and logs]





