Kubernetes kube-proxy and CNI rule conflicts: detection and fix

Pods stuck in ContainerCreating while the node reports Ready. Services time out despite existing endpoints. Intermittent connection resets during rolling updates. These symptoms usually indicate a conflict between kube-proxy and the container network interface (CNI) plugin over netfilter rules. Both components program the same kernel tables, compete for the same locks, and can corrupt each other’s chains. The data plane degrades while the control plane stays healthy.

This article covers how kube-proxy and CNI plugins interact on the netfilter path, the failure modes that arise when they conflict, and how to distinguish a rule conflict from a generic network outage.

What this means

kube-proxy implements Kubernetes Services by programming netfilter rules. In iptables mode, it writes chains such as KUBE-SERVICES, KUBE-SVC-*, and KUBE-SEP-* into the nat and filter tables. In ipvs mode, it still maintains a small iptables ruleset for masquerading and filtering. CNI plugins such as Calico, Cilium, Flannel, Weave, and AWS VPC CNI independently write their own chains into the same tables to handle pod ingress, egress, IP masquerading, and NetworkPolicy enforcement.

Because both subsystems share the xtables lock and the same kernel tables, three conflict classes appear in production:

  1. Lock contention. kube-proxy runs iptables-restore to atomically replace its chains. That operation holds the xtables lock exclusively. If a CNI plugin or NetworkPolicy controller tries to update rules concurrently, it blocks until kube-proxy finishes. When rule counts are high, the lock can be held for seconds, delaying pod sandbox creation and service updates.

  2. Rule corruption. Containerized CNI agents and host-level iptables tooling may use different iptables versions or backends. A save/modify/restore cycle from one component can strip match conditions that another component relies on, turning targeted rules into broad DROP statements.

  3. Sync lag under churn. During rolling updates or horizontal scaling, endpoint changes arrive faster than kube-proxy can program them. If the CNI stack is also rewriting rules for new pods, the combined churn extends the window during which traffic is routed to dead endpoints or new pods lack connectivity.

The result is pod network isolation: the kubelet reports the node as Ready, but new pods cannot start, existing pods lose Service connectivity, and packets are silently dropped in the kernel.

Common causes

CauseWhat it looks likeFirst thing to check
xtables lock contentionkube-proxy sync duration spikes; CNI plugin timeouts; pods stuck in ContainerCreatingkubeproxy_sync_proxy_rules_duration_seconds p99
iptables version mismatch between host and containersChains lose match conditions after a component restart or upgrade, producing bare DROP rulesiptables -t filter -S for malformed KUBE-* or CNI chains
Rapid endpoint and pod churnSync duration climbs during deployments; endpoint rules lag behind API statekubeproxy_sync_proxy_rules_endpoint_changes_pending
CNI plugin crash or evictionNew pods fail sandbox creation; existing pods retain network but new pods do notCNI DaemonSet pod status on the node
NetworkPolicy rule bloatMultiple controllers compete for the filter table; sync latency grows linearly with rule countiptables -t filter -S | wc -l

Quick checks

Run these checks on the affected node before restarting anything.

# Check kube-proxy sync latency and last successful sync timestamp
curl -s localhost:10249/metrics | grep -E "kubeproxy_sync_proxy_rules_duration_seconds|kubeproxy_sync_proxy_rules_last_timestamp_seconds"

# Check kube-proxy logs for xtables lock contention
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=500 | grep -i "xtables lock"

# Check CNI pod health on the node
kubectl get pods -n kube-system -l k8s-app=calico-node --field-selector spec.nodeName=$(hostname)

# Count kube-proxy iptables rules
iptables -t nat -S | grep -c "^-A KUBE-"

# Check conntrack table utilization (percentage)
echo $(( 100 * $(cat /proc/sys/net/netfilter/nf_conntrack_count) / $(cat /proc/sys/net/netfilter/nf_conntrack_max) ))

# List pods stuck in ContainerCreating on this node
kubectl get pods --all-namespaces --field-selector spec.nodeName=$(hostname),status.phase=Pending -o json | jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ContainerCreating") | .metadata.namespace + "/" + .metadata.name'

If kube-proxy is running in ipvs mode, replace the iptables rule count with:

# Count IPVS virtual servers and real servers
ipvsadm -Ln | grep -c "^TCP\|^UDP"
ipvsadm -Ln | grep -c "^\s*->"

If the node is experiencing CNI-related sandbox failures, check the CNI-specific logs. For Calico, read the calico-node pod logs. For containerd-based runtimes, check journalctl -u containerd for CNI invocation errors.

How to diagnose it

  1. Confirm kube-proxy sync duration is elevated. Query kubeproxy_sync_proxy_rules_duration_seconds from the metrics endpoint. In iptables mode, healthy syncs typically finish in under one second for clusters with fewer than one thousand Services. If p99 exceeds five seconds, the sync loop is stressed.

  2. Check for xtables lock errors in kube-proxy logs. Look for messages such as Another app is currently holding the xtables lock. If these appear more than once per minute, the node has significant lock contention between kube-proxy and another iptables consumer.

  3. Correlate sync spikes with pod lifecycle events. Check whether endpoint churn is outpacing sync capacity by comparing sync duration with the rate of kubeproxy_sync_proxy_rules_endpoint_changes_total. Cross-reference with Deployment rollout timestamps.

  4. Verify CNI plugin health and rule activity. Check whether the CNI DaemonSet pod on the node is running or restarting. If the CNI plugin is alive, inspect its logs for iptables-restore timeouts or netlink errors. These indicate the CNI plugin is waiting for the xtables lock or failing to apply its own rules.

  5. Inspect iptables chains for corruption or gaps. Run iptables -t nat -S and iptables -t filter -S. Look for KUBE-* chains that are missing expected match conditions, such as a bare -j DROP in KUBE-FIREWALL without a preceding mark match condition. Also verify that expected KUBE-SVC-* chains exist for active Services.

  6. Check if the conntrack table is under pressure. High conntrack utilization combined with sync lag can cause connection tracking entries to be dropped before rules are fully updated. This produces intermittent connection timeouts that look like routing failures but are actually table exhaustion.

  7. Determine whether the issue is node-local or cluster-wide. If only one node is affected, the cause is usually local lock contention or a crashed CNI pod. If all nodes show elevated sync duration simultaneously, look for a cluster-wide event such as a mass Deployment rollout or an API server watch disruption.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kubeproxy_sync_proxy_rules_duration_seconds p99Measures how long kube-proxy holds the xtables lock during each syncp99 > 5 seconds sustained
kubeproxy_sync_proxy_rules_last_timestamp_seconds ageA stale timestamp means rules are not being refreshedAge > 60 seconds
kubeproxy_sync_proxy_rules_endpoint_changes_pendingIndicates endpoint churn is outpacing sync capacityNon-zero and growing for > 2 minutes
xtables lock errors in kube-proxy logsDirect evidence of contention with CNI or other components> 1 error per minute
conntrack table utilizationShared resource; exhaustion drops new connections silently> 75% of nf_conntrack_max
iptables rule countScales O(services * endpoints) in iptables mode; high counts extend lock hold timeRapid growth or > 10,000 rules
CNI plugin pod restart countA restarting CNI plugin can leave partial or conflicting rulesAny restart on a node with network symptoms
Pod sandbox creation latencyCNI timeouts waiting for the xtables lock block pod startupContainerCreating > 5 minutes

Fixes

If the cause is xtables lock contention

Reduce lock pressure before migrating architectures.

  • Increase kube-proxy’s --iptables-sync-period to reduce how often it grabs the lock. The tradeoff is that rules stay stale longer between syncs.
  • Cordon the node to stop new pod scheduling. Fewer new pods means fewer CNI operations competing for the lock.
  • Identify the competing process with lsof /run/xtables.lock. If a NetworkPolicy controller or DaemonSet is flapping, restart it to clear the lock storm.
  • For clusters with thousands of Services, plan a migration to ipvs mode. IPVS uses hash-based lookups and incremental updates, which dramatically reduces xtables lock contention.
  • Evaluate nftables mode if your Kubernetes version and kernel support it. nftables uses per-table transactions and avoids the global xtables lock, though cross-table priority ordering still requires verification with your CNI plugin.

If the cause is rule corruption

Do not restart services blindly. A restart with corrupted rules can make the node unreachable.

  • Dump current rules with iptables-save and compare them against a healthy node. Look for missing match conditions in KUBE-FIREWALL or KUBE-MARK-MASQ chains.
  • If a specific chain is corrupted, cordon the node, flush only the affected kube-proxy chains with iptables -t <table> -F <chain>, then restart kube-proxy to reprogram clean rules. Warning: flushing chains interrupts Service traffic on the node.
  • Ensure that containerized CNI agents and the host use compatible iptables backends. A mismatch between legacy iptables and iptables-nft can cause silent rule stripping during save/restore cycles.

If the cause is CNI plugin failure

  • Restart the CNI DaemonSet pod on the affected node. For Calico, delete the calico-node pod. For AWS VPC CNI, delete the aws-node pod.
  • Verify CNI configuration files in /etc/cni/net.d/ have not been overwritten or truncated.
  • Check IPAM allocation status. If the CNI cannot assign a pod IP, sandbox creation fails before kube-proxy rules matter.
  • If the CNI plugin is being OOM-killed, increase its memory limit. CNI pods are often memory-starved on dense nodes.

If the cause is conntrack exhaustion

  • Increase the table size immediately: sysctl -w net.netfilter.nf_conntrack_max=<higher_value>. Each entry consumes roughly 300 bytes of kernel memory.
  • Identify connection leaks with conntrack -L and conntrack -S. If TIME_WAIT or UDP entries dominate, tune the respective timeouts or fix the application connection pooling.
  • See Kubernetes conntrack exhaustion: dropped connections under load for deeper tuning.

Prevention

  • Monitor kubeproxy_sync_proxy_rules_duration_seconds p99 and alert when it exceeds three seconds for more than five minutes. This catches lock contention before it blocks pod creation.
  • Monitor xtables lock errors from kube-proxy and CNI logs. Any sustained rate indicates architectural scaling pressure.
  • Size nodes with conntrack headroom. Keep utilization below sixty percent during peak traffic to absorb bursts.
  • Keep CNI and kube-proxy versions aligned. Run validation after Kubernetes upgrades to confirm that CNI agents still program rules correctly against the host’s iptables or nftables backend.
  • For clusters running more than two thousand Services, adopt ipvs mode or nftables mode proactively. iptables mode scales linearly and will eventually overwhelm the sync loop.
  • Limit endpoint churn where possible. Avoid rapid scaling events that simultaneously replace hundreds of pods and rewrite thousands of iptables rules.

How Netdata helps

  • Netdata collects kube-proxy sync duration and rule count metrics from :10249/metrics, exposing p99 latency and queue depth per node.
  • Conntrack utilization charts (net.netfilter.nf_conntrack_count vs max) show saturation before packets drop.
  • Pod-level network metrics and CNI pod health status are visible alongside kube-proxy data, so you can correlate sandbox creation timeouts with sync latency spikes.
  • Node CPU softirq time and network stack latency charts help distinguish xtables lock contention from generic CPU saturation.
flowchart TD
    A[Pod network failure or slow startup] --> B{Check kube-proxy sync duration}
    B -->|Elevated| C{Check logs for xtables lock}
    B -->|Normal| D[CNI or other network issue]
    C -->|Lock errors present| E[xtables lock contention]
    C -->|No lock errors| F{Check rule count and chain validity}
    F -->|Orphaned or corrupted chains| G[Rule corruption or version skew]
    F -->|Rules valid but timestamp stale| H[API server watch failure]
    E --> I[Cordon node, increase sync period, plan IPVS or nftables migration]
    G --> J[Flush corrupted chains, restart kube-proxy, align iptables versions]
    D --> K[Check CNI pod health, IPAM, and sandbox logs]