Kubernetes kube-proxy iptables sync stall: causes and recovery

Pods fail to start. Services intermittently route traffic to dead endpoints. The kube-proxy health endpoint still returns HTTP 200, so the DaemonSet looks healthy, yet rules drift further behind with every sync cycle.

An iptables sync stall is not a crash. It is a slowdown or blockage in the control loop that translates Service and EndpointSlice state into kernel NAT rules. When kube-proxy cannot acquire the global xtables lock, when iptables-restore hangs, or when the rule set grows too large to reconcile within the sync period, the node forwards packets using stale rules. New endpoints are invisible. Terminated pods still receive connections. CNI plugins that also need the xtables lock time out, and pod sandbox creation fails.

This guide covers failure mechanics, distinguishing sync stalls from other network problems, and recovery steps.

What this means

kube-proxy in iptables mode reconciles desired state by building a complete set of iptables NAT rules and applying them atomically via iptables-restore. This requires the global xtables lock (/run/xtables.lock). While kube-proxy holds that lock, no other process on the node can modify iptables rules.

The default syncPeriod is 30 seconds. Kubernetes considers kube-proxy unhealthy on the :10256/healthz endpoint when network programming exceeds twice that value, or 60 seconds by default. A stall begins when a single sync takes long enough to miss the next cycle, when lock contention causes repeated restore failures, or when the process hangs entirely inside iptables-restore.

Since Kubernetes v1.28, the iptables proxier uses incremental updates, rewriting only rules for changed Services and EndpointSlices. On older versions, or at extreme scale, every sync may still rewrite the full table. Either way, the symptoms are identical: the gap between API server state and kernel state grows, and the node silently drops or misroutes connections.

flowchart TD
    A[EndpointSlice change or syncPeriod tick] --> B{Can kube-proxy acquire xtables lock?}
    B -->|Yes| C[Run iptables-restore]
    B -->|No| D[Sync aborts or retries]
    C --> E{Sync duration < syncPeriod?}
    E -->|Yes| F[Rules updated]
    E -->|No| G[Sync backlog grows]
    D --> H[Stale endpoints persist]
    G --> H
    H --> I[CNI plugins timeout on lock]
    H --> J[Traffic routed to dead pods]
    I --> K[Pod startup fails or slows]
    J --> L[Connection resets or drops]

Common causes

CauseWhat it looks likeFirst thing to check
xtables lock contentionsync_proxy_rules_iptables_restore_failures_total increasing; kube-proxy logs show lock acquisition failures; pods stuck in ContainerCreating with CNI timeoutslsof /run/xtables.lock and kube-proxy logs for xtables lock
Rule bloat (large clusters)kubeproxy_sync_proxy_rules_duration_seconds p99 approaching or exceeding 30s; thousands of iptables rulesiptables -t nat -S | wc -l
iptables-restore hangSync duration flatlines; iptables-restore process visible in ps for minutes; kube-proxy stops advancing its last sync timestampps aux | grep iptables-restore and sync timestamp metric
Endpoint churn exceeding sync capacitykubeproxy_sync_proxy_rules_endpoint_changes_pending growing during rolling updatesPending endpoint changes vs processed total
API server watch death (silent)Last sync timestamp frozen; no error logs; new Services unreachable from the nodess -tnp | grep kube-proxy | grep 6443 and sync timestamp age

Quick checks

# Check last successful sync age (should be well under 60s)
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds

# Check p99 sync duration against the 30s syncPeriod
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration_seconds

# Count kube-proxy managed iptables rules
sudo iptables -t nat -S | grep -c "KUBE-"

# Check for lock contention messages
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep -i "xtables\|lock"

# Check pending endpoint changes
curl -s http://localhost:10249/metrics | grep endpoint_changes_pending

# Check iptables restore failure rate
curl -s http://localhost:10249/metrics | grep sync_proxy_rules_iptables_restore_failures_total

# Identify what holds the xtables lock
sudo lsof /run/xtables.lock

# Check conntrack utilization (adjacent shared resource)
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc

# Check kube-proxy binary health (200 does not mean fresh rules)
curl -s -o /dev/null -w "%{http_code}" http://localhost:10256/healthz

# Check CNI timeouts that correlate with lock contention
journalctl -u kubelet --since "5 minutes ago" | grep -i "CNI.*timeout\|sandbox"

How to diagnose it

  1. Confirm kube-proxy is running but stale. Query http://localhost:10256/healthz. A 200 means the initial sync completed, not that rules are current. If traffic fails while healthz returns 200, you have a silent stall.

    • Why it matters: Restarting a crashed kube-proxy differs from unsticking a live one.
    • Next: check the last sync timestamp.
  2. Measure sync freshness. Inspect kubeproxy_sync_proxy_rules_last_timestamp_seconds and subtract it from the current epoch. A gap greater than twice the configured syncPeriod means the node is programming rules too slowly or not at all.

    • Why it matters: This distinguishes a transient spike from a sustained stall.
    • Next: inspect sync duration percentiles.
  3. Compare sync duration to syncPeriod. Look at the p99 of kubeproxy_sync_proxy_rules_duration_seconds. In iptables mode, values above 10 seconds are concerning; values approaching the 30-second syncPeriod indicate the loop is about to backlog.

    • Why it matters: Once sync duration exceeds syncPeriod, kube-proxy can never catch up on a busy node.
    • Next: determine if the cause is lock contention or rule bloat.
  4. Check for xtables lock contention. Search kube-proxy logs for messages about the xtables lock. Run lsof /run/xtables.lock to see which process holds it. Common culprits include CNI portmap plugins, Flannel, and VPC CNIs that refresh rules periodically.

    • Why it matters: The lock is global. If a CNI plugin holds it, kube-proxy blocks, and vice versa.
    • Next: check CNI plugin logs for iptables timeouts.
  5. Assess rule set scale. Count rules with iptables -t nat -S | wc -l. In iptables mode, performance degrades linearly as rule count grows. Above 10,000 rules, sync times become a scaling bottleneck.

    • Why it matters: This tells you whether the stall is architectural (iptables mode limits) or environmental (lock contention).
    • Next: if rule count is high, evaluate proxy mode migration.
  6. Look for hung iptables-restore processes. Run ps aux | grep iptables-restore. If a process has been running for minutes or hours, the nf_tables backend may be stuck.

    • Why it matters: A hung restore blocks the entire sync loop until the process is killed.
    • Next: kill the hung process and restart kube-proxy.
  7. Check endpoint churn backlog. Query kubeproxy_sync_proxy_rules_endpoint_changes_pending. If the number is non-zero and growing, the API server is generating changes faster than the node can apply them.

    • Why it matters: This happens during large rolling updates or HPA storms.
    • Next: slow the churn or increase sync capacity.
  8. Verify API server watch connectivity. Check rest_client_requests_total for 5xx or 429 errors, and verify with ss that kube-proxy has an active TCP connection to the API server on port 6443.

    • Why it matters: A dead watch causes silent staleness with no iptables errors.
    • Next: restart kube-proxy to re-establish the watch.
  9. Rule out conntrack exhaustion. Check nf_conntrack_count against nf_conntrack_max. Sync stalls often occur alongside connection churn that fills the conntrack table, producing identical timeout symptoms.

    • Why it matters: Fixing kube-proxy does not help if the kernel is dropping packets because the conntrack table is full.
    • Next: if utilization is above 90%, increase the limit or reduce connection churn.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kubeproxy_sync_proxy_rules_duration_seconds p99Measures time to reconcile all iptables rulesp99 > 10s, or > 80% of syncPeriod
Age of kubeproxy_sync_proxy_rules_last_timestamp_secondsIndicates how stale the programmed rules areAge > 2 x syncPeriod
kubeproxy_sync_proxy_rules_endpoint_changes_pendingBacklog of unprocessed endpoint updatesNon-zero and growing
sync_proxy_rules_iptables_restore_failures_total rateDirect signal of lock contention or restore failureAny sustained increase
iptables rule countScaling indicator in iptables mode> 10,000 rules
nf_conntrack_count / nf_conntrack_maxShared kernel resource consumed by NAT> 80% utilization
kube-proxy healthz / readyzBinary readiness stateNon-200, or readyz returning 503
rest_client_requests_total errorsAPI server connectivity from kube-proxy5xx or 429 responses
CNI sandbox creation latencyDownstream symptom of xtables lock contentionTimeouts during pod creation
Pod startup duration on nodeEnd-to-end impact of stalled rulesp99 > 60s

Fixes

If the cause is xtables lock contention

Identify the competing process with lsof /run/xtables.lock. If it is a CNI plugin or daemon, restart it and reduce its iptables refresh frequency. As a longer-term fix, evaluate migrating the cluster to IPVS or nftables mode. IPVS uses hash-based lookups and does not hold the global xtables lock during updates. nftables mode uses per-table locking, which reduces contention between kube-proxy and CNI plugins.

If the cause is rule bloat or high sync duration

Audit the cluster for unnecessary Services and large EndpointSlices. If the cluster has grown beyond the comfortable limit for iptables mode, increase the syncPeriod temporarily to allow full syncs to complete, then plan a migration to IPVS or nftables. Do not increase sync frequency; that worsens the problem.

If the cause is a hung iptables-restore

Kill the hung iptables-restore process, then delete the kube-proxy pod to force a restart and full re-sync. If the node runs an affected iptables version with the nf_tables backend, upgrade iptables or the host image.

If the cause is endpoint churn exceeding capacity

Reduce deployment rollout surge or HPA scale-out rate. Spread large deployments across time windows. If the cluster is legitimately high-churn, move to IPVS mode, which supports incremental updates without rewriting the entire table.

If conntrack is exhausted

Immediately increase nf_conntrack_max to buy headroom. Then identify whether the root cause is a connection leak, excessive UDP traffic, or overly long TIME_WAIT timeouts. Tune nf_conntrack_udp_timeout_stream for UDP-heavy workloads.

Prevention

  • Alert on kubeproxy_sync_proxy_rules_duration_seconds p99 crossing 5 seconds (or 25% of your configured syncPeriod), not just on kube-proxy restarts.
  • Collect and alert on kube-proxy log messages containing xtables lock to catch contention before it causes sync failures.
  • Monitor iptables rule count per node and establish a runway projection. Plan a migration to IPVS or nftables before reaching 10,000 rules.
  • Size nf_conntrack_max for peak traffic plus headroom. Account for TIME_WAIT and UDP entries, not just established TCP connections.
  • If you run Kubernetes v1.28 or later, revisit legacy minSyncPeriod tunings that were previously used to mitigate full-table rewrite overhead. Incremental updates make large values unnecessary and can delay convergence.
  • Exercise failure modes in staging: terminate kube-proxy watches, simulate high endpoint churn, and measure sync latency under load.

How Netdata helps

  • Correlate sync duration with node CPU softirq time, CNI timeouts, and conntrack utilization.
  • Alert on sync timestamp age without manual metric scraping across nodes.
  • Flag xtables lock contention from kube-proxy logs.
  • Track iptables rule count trends to forecast scaling limits.