Kubernetes pod network isolation: when one node loses pod connectivity
One node in your cluster drops off the pod network. Pods still show Running, but they cannot reach Services, other pods, or external endpoints. The rest of the cluster is unaffected. This is single-node pod network isolation, and it is almost always a node-local CNI or data path failure.
Symptoms are subtle compared to a full cluster outage. Applications log connection timeouts. Health checks fail. New pods stick in ContainerCreating. Because the node often stays Ready, operators may blame the application rather than the network layer.
This guide shows how to confirm the failure is node-local, identify the root cause, and restore connectivity safely.
What this means
Kubernetes delegates pod networking to a CNI plugin, usually running as a DaemonSet. The plugin creates the veth pair, attaches it to the host bridge or overlay, and sets routes. When the plugin fails on one node, existing pods usually keep their assigned IPs, but the path between the pod network namespace and the host collapses. New pods cannot start because kubelet cannot create the network sandbox.
The node may stay Ready because kubelet and the container runtime are healthy. However, the NetworkUnavailable condition may flip to True, indicating the CNI has not finished wiring the node. From the cluster perspective, the node is up but its workloads are unreachable.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| CNI DaemonSet pod failure on the node | Existing pods lose cluster connectivity; new pods stuck in ContainerCreating | CNI pod status on the node |
| CNI config version skew with containerd 1.6.0-1.6.3 | FailedCreatePodSandBox events containing incompatible CNI versions | /etc/cni/net.d/ config cniVersion and containerd version |
| Conntrack table exhaustion | Intermittent connection timeouts under load; nf_conntrack: table full in dmesg | /proc/sys/net/netfilter/nf_conntrack_count vs nf_conntrack_max |
| Firewall blocking CNI ports | Cross-node traffic drops; intra-node traffic may still work | Host and cloud firewall rules for CNI ports |
| MTU mismatch between host and pod interfaces | Small ICMP pings succeed; HTTP requests hang after the TCP handshake | ip link output on host and inside a pod |
| Stale conntrack entries after endpoint churn | UDP DNS timeouts after CoreDNS pods are replaced | conntrack -L for entries pointing to old endpoint IPs |
Quick checks
# Check node network conditions
kubectl describe node <node-name> | grep -A 10 "^Conditions:"
# Check the CNI pod on the affected node
kubectl get pods -n kube-system --field-selector spec.nodeName=<node-name> -o wide
# Look for sandbox creation failures tied to the node
kubectl get events --field-selector involvedObject.kind=Pod,reason=FailedCreatePodSandBox --sort-by='.lastTimestamp' | grep <node-name>
# Inspect CNI configuration version
grep -h cniVersion /etc/cni/net.d/* | head -n 5
# Check conntrack utilization
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc
# Check host interface MTU
ip link show $(ip route show default | head -n1 | awk '{print $5}') | grep mtu
# Check CNI-specific ports (Flannel VXLAN example)
ss -ulnp | grep -E "8472|8285"
# Check containerd version for known CNI compatibility issues
containerd --version
How to diagnose it
Confirm the blast radius is a single node. Start a debug pod on the affected node using
nodeName: <node>and attempt to reach a known good pod on another node. If the failure is isolated to one node, the problem is local to its CNI or data path.Check the NetworkUnavailable condition. Run
kubectl describe node. IfNetworkUnavailableis True, the CNI control plane has not established the node’s network.Inspect the CNI pod. Use labels that match your plugin, such as
k8s-app=calico-node,k8s-app=cilium,app=flannel, ork8s-app=weave-net. If the pod is CrashLoopBackOff, OOMKilled, or not present, the node cannot program pod networking.Read kubelet sandbox events. Look for
NetworkPlugin cni failed to set up podin the event message. This confirms kubelet invoked the CNI binary but the setup operation returned an error, pointing to a config, binary, or runtime compatibility issue.Validate CNI configuration on disk. Check
/etc/cni/net.d/. Ensure thecniVersiondeclared in the config is supported by the installed plugin binaries. If you are running containerd 1.6.0 through 1.6.3, the config must declare version 1.0.0 or later; otherwise sandbox creation fails with an incompatible version error. Upgrading containerd to 1.6.4 or later resolves this.Test the data path from inside a pod. Exec into a Running pod on the node and run
ip linkto inspect its interface MTU. Then send large pings withping -M do -s 1472 <target-pod-ip>if the image supports it, or start a TCP transfer withcurl. If small packets pass but large payloads hang after the handshake, an MTU mismatch is silently dropping traffic.Check conntrack state and kernel logs. Search dmesg for
nf_conntrack: table full, dropping packet. If conntrack is exhausted, new connections are silently dropped. Also look for stale UDP entries. When CoreDNS pods are replaced, old conntrack entries for the previous pod IPs can persist, causing DNS queries to be sent to dead endpoints.Verify inter-node firewall paths. If the cluster uses Flannel, confirm UDP port 8472 (VXLAN) or 8285 (UDP backend) is open between all nodes. For AWS VPC CNI, verify the instance has not hit its ENI limit, which prevents IP assignment and causes sandbox creation failures.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Node condition NetworkUnavailable | Directly indicates the CNI plugin has not finished wiring the node | Status True for more than 60 seconds |
| CNI DaemonSet pod restarts per node | A crashing CNI pod isolates the node immediately | Restart count increasing on the node’s CNI pod |
FailedCreatePodSandBox event rate | Confirms kubelet cannot delegate network setup to the CNI plugin | Events concentrated on a single node |
| Conntrack utilization ratio | A full table silently drops new connections across all pods on the node | Sustained above 80% of nf_conntrack_max |
| Cross-node pod-to-pod latency and loss | Detects partial data path breaks like MTU mismatches or firewall blocks | Loss or latency spikes only from the affected node |
kubelet run_podsandbox operation errors | Surfaces CNI invocation failures from the runtime side | kubelet_runtime_operations_errors_total increasing for run_podsandbox |
Fixes
If the cause is CNI plugin failure
Delete the CNI pod on the affected node and let the DaemonSet recreate it. If the pod is OOMKilled, raise its memory limit or reduce pod density. If the node has exceeded its ENI or IP pool limit, common with AWS VPC CNI, free unused IPs or select an instance type that supports more ENIs.
If the cause is CNI configuration drift
Restore the correct CNI config file in /etc/cni/net.d/. Ensure the cniVersion in the config matches the plugin binary capabilities. On nodes running containerd 1.6.0 through 1.6.3, upgrade containerd to 1.6.4 or later, or ensure both the config and the loopback plugin declare version 1.0.0.
If the cause is conntrack exhaustion
Immediately increase the table size:
sysctl -w net.netfilter.nf_conntrack_max=<higher_value>
Then identify the source: high connection churn, TIME_WAIT accumulation, or a connection leak. For UDP-heavy workloads, reduce net.netfilter.nf_conntrack_udp_timeout_stream. See Kubernetes conntrack exhaustion for detailed tuning.
If the cause is a firewall or security group
Open the CNI-specific ports between all cluster nodes. For Flannel, allow UDP 8472 for VXLAN and UDP 8285 for the UDP backend. Verify that host-level iptables or nftables rules and cloud security groups are not dropping encapsulated traffic.
If the cause is an MTU mismatch
Set the pod interface MTU in the CNI configuration to match the host’s primary interface or the VPC MTU. For example, if the host uses jumbo frames but the CNI bridge defaults to 1500, large packets will be dropped after the TCP handshake. Adjust the CNI config and restart the CNI pods to apply the change.
If the cause is stale conntrack entries
Flush stale entries manually with conntrack -D -d <old-pod-ip> if necessary. To prevent recurrence, ensure DNS workloads use graceful termination periods long enough for clients to switch, and consider NodeLocal DNSCache to reduce cross-node UDP conntrack pressure.
Prevention
- Monitor the CNI DaemonSet with per-node alerts for pods that are not in the Running state.
- Alert on
NetworkUnavailable=Trueon any node. - Track conntrack utilization on every node and size
nf_conntrack_maxrelative to peak connection counts; the default 65536 is often too low for dense nodes. - Pin CNI and containerd versions in your node image pipeline and validate compatibility before rollout.
- Set explicit MTU values in CNI configs during node bootstrap.
- Document required inter-node firewall rules for your CNI plugin in your network runbook.
- Monitor
FailedCreatePodSandBoxevent rate per node to catch CNI regressions before workloads are scheduled.
How Netdata helps
- Netdata tracks node-level conntrack utilization and alerts when the table approaches its limit, before connections start dropping.
- The kernel error chart surfaces
dmesgmessages such asnf_conntrack: table fullwithout requiring manual node logins. - Per-node pod health monitoring correlates CNI pod restarts with sandbox creation failures in the same time window.
- Network latency and packet-drop metrics per interface help isolate MTU or firewall issues to a specific node.
Related guides
- Kubernetes conntrack exhaustion: dropped connections under load
- Kubernetes DNS resolution failures inside pods
- Kubernetes DaemonSet pods Pending: scheduling and tolerations
- Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas
flowchart TD
A[Pods on one node lose connectivity] --> B{NetworkUnavailable True?}
B -->|Yes| C[Check CNI pod status and logs]
B -->|No| D[Check conntrack and MTU]
C --> E{CNI pod Running?}
E -->|No| F[Restart or fix CNI pod resources]
E -->|Yes| G[Check CNI config version and containerd compatibility]
D --> H{Conntrack table full?}
H -->|Yes| I[Increase nf_conntrack_max and reduce churn]
H -->|No| J[Check MTU and inter-node firewall ports]





