Kubernetes kube-proxy iptables sync stall: causes and recovery
Pods fail to start. Services intermittently route traffic to dead endpoints. The kube-proxy health endpoint still returns HTTP 200, so the DaemonSet looks healthy, yet rules drift further behind with every sync cycle.
An iptables sync stall is not a crash. It is a slowdown or blockage in the control loop that translates Service and EndpointSlice state into kernel NAT rules. When kube-proxy cannot acquire the global xtables lock, when iptables-restore hangs, or when the rule set grows too large to reconcile within the sync period, the node forwards packets using stale rules. New endpoints are invisible. Terminated pods still receive connections. CNI plugins that also need the xtables lock time out, and pod sandbox creation fails.
This guide covers failure mechanics, distinguishing sync stalls from other network problems, and recovery steps.
What this means
kube-proxy in iptables mode reconciles desired state by building a complete set of iptables NAT rules and applying them atomically via iptables-restore. This requires the global xtables lock (/run/xtables.lock). While kube-proxy holds that lock, no other process on the node can modify iptables rules.
The default syncPeriod is 30 seconds. Kubernetes considers kube-proxy unhealthy on the :10256/healthz endpoint when network programming exceeds twice that value, or 60 seconds by default. A stall begins when a single sync takes long enough to miss the next cycle, when lock contention causes repeated restore failures, or when the process hangs entirely inside iptables-restore.
Since Kubernetes v1.28, the iptables proxier uses incremental updates, rewriting only rules for changed Services and EndpointSlices. On older versions, or at extreme scale, every sync may still rewrite the full table. Either way, the symptoms are identical: the gap between API server state and kernel state grows, and the node silently drops or misroutes connections.
flowchart TD
A[EndpointSlice change or syncPeriod tick] --> B{Can kube-proxy acquire xtables lock?}
B -->|Yes| C[Run iptables-restore]
B -->|No| D[Sync aborts or retries]
C --> E{Sync duration < syncPeriod?}
E -->|Yes| F[Rules updated]
E -->|No| G[Sync backlog grows]
D --> H[Stale endpoints persist]
G --> H
H --> I[CNI plugins timeout on lock]
H --> J[Traffic routed to dead pods]
I --> K[Pod startup fails or slows]
J --> L[Connection resets or drops]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| xtables lock contention | sync_proxy_rules_iptables_restore_failures_total increasing; kube-proxy logs show lock acquisition failures; pods stuck in ContainerCreating with CNI timeouts | lsof /run/xtables.lock and kube-proxy logs for xtables lock |
| Rule bloat (large clusters) | kubeproxy_sync_proxy_rules_duration_seconds p99 approaching or exceeding 30s; thousands of iptables rules | iptables -t nat -S | wc -l |
| iptables-restore hang | Sync duration flatlines; iptables-restore process visible in ps for minutes; kube-proxy stops advancing its last sync timestamp | ps aux | grep iptables-restore and sync timestamp metric |
| Endpoint churn exceeding sync capacity | kubeproxy_sync_proxy_rules_endpoint_changes_pending growing during rolling updates | Pending endpoint changes vs processed total |
| API server watch death (silent) | Last sync timestamp frozen; no error logs; new Services unreachable from the node | ss -tnp | grep kube-proxy | grep 6443 and sync timestamp age |
Quick checks
# Check last successful sync age (should be well under 60s)
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# Check p99 sync duration against the 30s syncPeriod
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration_seconds
# Count kube-proxy managed iptables rules
sudo iptables -t nat -S | grep -c "KUBE-"
# Check for lock contention messages
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep -i "xtables\|lock"
# Check pending endpoint changes
curl -s http://localhost:10249/metrics | grep endpoint_changes_pending
# Check iptables restore failure rate
curl -s http://localhost:10249/metrics | grep sync_proxy_rules_iptables_restore_failures_total
# Identify what holds the xtables lock
sudo lsof /run/xtables.lock
# Check conntrack utilization (adjacent shared resource)
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc
# Check kube-proxy binary health (200 does not mean fresh rules)
curl -s -o /dev/null -w "%{http_code}" http://localhost:10256/healthz
# Check CNI timeouts that correlate with lock contention
journalctl -u kubelet --since "5 minutes ago" | grep -i "CNI.*timeout\|sandbox"
How to diagnose it
Confirm kube-proxy is running but stale. Query
http://localhost:10256/healthz. A 200 means the initial sync completed, not that rules are current. If traffic fails while healthz returns 200, you have a silent stall.- Why it matters: Restarting a crashed kube-proxy differs from unsticking a live one.
- Next: check the last sync timestamp.
Measure sync freshness. Inspect
kubeproxy_sync_proxy_rules_last_timestamp_secondsand subtract it from the current epoch. A gap greater than twice the configuredsyncPeriodmeans the node is programming rules too slowly or not at all.- Why it matters: This distinguishes a transient spike from a sustained stall.
- Next: inspect sync duration percentiles.
Compare sync duration to syncPeriod. Look at the p99 of
kubeproxy_sync_proxy_rules_duration_seconds. In iptables mode, values above 10 seconds are concerning; values approaching the 30-second syncPeriod indicate the loop is about to backlog.- Why it matters: Once sync duration exceeds syncPeriod, kube-proxy can never catch up on a busy node.
- Next: determine if the cause is lock contention or rule bloat.
Check for xtables lock contention. Search kube-proxy logs for messages about the xtables lock. Run
lsof /run/xtables.lockto see which process holds it. Common culprits include CNI portmap plugins, Flannel, and VPC CNIs that refresh rules periodically.- Why it matters: The lock is global. If a CNI plugin holds it, kube-proxy blocks, and vice versa.
- Next: check CNI plugin logs for iptables timeouts.
Assess rule set scale. Count rules with
iptables -t nat -S | wc -l. In iptables mode, performance degrades linearly as rule count grows. Above 10,000 rules, sync times become a scaling bottleneck.- Why it matters: This tells you whether the stall is architectural (iptables mode limits) or environmental (lock contention).
- Next: if rule count is high, evaluate proxy mode migration.
Look for hung iptables-restore processes. Run
ps aux | grep iptables-restore. If a process has been running for minutes or hours, the nf_tables backend may be stuck.- Why it matters: A hung restore blocks the entire sync loop until the process is killed.
- Next: kill the hung process and restart kube-proxy.
Check endpoint churn backlog. Query
kubeproxy_sync_proxy_rules_endpoint_changes_pending. If the number is non-zero and growing, the API server is generating changes faster than the node can apply them.- Why it matters: This happens during large rolling updates or HPA storms.
- Next: slow the churn or increase sync capacity.
Verify API server watch connectivity. Check
rest_client_requests_totalfor 5xx or 429 errors, and verify withssthat kube-proxy has an active TCP connection to the API server on port 6443.- Why it matters: A dead watch causes silent staleness with no iptables errors.
- Next: restart kube-proxy to re-establish the watch.
Rule out conntrack exhaustion. Check
nf_conntrack_countagainstnf_conntrack_max. Sync stalls often occur alongside connection churn that fills the conntrack table, producing identical timeout symptoms.- Why it matters: Fixing kube-proxy does not help if the kernel is dropping packets because the conntrack table is full.
- Next: if utilization is above 90%, increase the limit or reduce connection churn.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kubeproxy_sync_proxy_rules_duration_seconds p99 | Measures time to reconcile all iptables rules | p99 > 10s, or > 80% of syncPeriod |
Age of kubeproxy_sync_proxy_rules_last_timestamp_seconds | Indicates how stale the programmed rules are | Age > 2 x syncPeriod |
kubeproxy_sync_proxy_rules_endpoint_changes_pending | Backlog of unprocessed endpoint updates | Non-zero and growing |
sync_proxy_rules_iptables_restore_failures_total rate | Direct signal of lock contention or restore failure | Any sustained increase |
| iptables rule count | Scaling indicator in iptables mode | > 10,000 rules |
nf_conntrack_count / nf_conntrack_max | Shared kernel resource consumed by NAT | > 80% utilization |
| kube-proxy healthz / readyz | Binary readiness state | Non-200, or readyz returning 503 |
rest_client_requests_total errors | API server connectivity from kube-proxy | 5xx or 429 responses |
| CNI sandbox creation latency | Downstream symptom of xtables lock contention | Timeouts during pod creation |
| Pod startup duration on node | End-to-end impact of stalled rules | p99 > 60s |
Fixes
If the cause is xtables lock contention
Identify the competing process with lsof /run/xtables.lock. If it is a CNI plugin or daemon, restart it and reduce its iptables refresh frequency. As a longer-term fix, evaluate migrating the cluster to IPVS or nftables mode. IPVS uses hash-based lookups and does not hold the global xtables lock during updates. nftables mode uses per-table locking, which reduces contention between kube-proxy and CNI plugins.
If the cause is rule bloat or high sync duration
Audit the cluster for unnecessary Services and large EndpointSlices. If the cluster has grown beyond the comfortable limit for iptables mode, increase the syncPeriod temporarily to allow full syncs to complete, then plan a migration to IPVS or nftables. Do not increase sync frequency; that worsens the problem.
If the cause is a hung iptables-restore
Kill the hung iptables-restore process, then delete the kube-proxy pod to force a restart and full re-sync. If the node runs an affected iptables version with the nf_tables backend, upgrade iptables or the host image.
If the cause is endpoint churn exceeding capacity
Reduce deployment rollout surge or HPA scale-out rate. Spread large deployments across time windows. If the cluster is legitimately high-churn, move to IPVS mode, which supports incremental updates without rewriting the entire table.
If conntrack is exhausted
Immediately increase nf_conntrack_max to buy headroom. Then identify whether the root cause is a connection leak, excessive UDP traffic, or overly long TIME_WAIT timeouts. Tune nf_conntrack_udp_timeout_stream for UDP-heavy workloads.
Prevention
- Alert on
kubeproxy_sync_proxy_rules_duration_secondsp99 crossing 5 seconds (or 25% of your configuredsyncPeriod), not just on kube-proxy restarts. - Collect and alert on kube-proxy log messages containing
xtables lockto catch contention before it causes sync failures. - Monitor iptables rule count per node and establish a runway projection. Plan a migration to IPVS or nftables before reaching 10,000 rules.
- Size
nf_conntrack_maxfor peak traffic plus headroom. Account for TIME_WAIT and UDP entries, not just established TCP connections. - If you run Kubernetes v1.28 or later, revisit legacy
minSyncPeriodtunings that were previously used to mitigate full-table rewrite overhead. Incremental updates make large values unnecessary and can delay convergence. - Exercise failure modes in staging: terminate kube-proxy watches, simulate high endpoint churn, and measure sync latency under load.
How Netdata helps
- Correlate sync duration with node CPU softirq time, CNI timeouts, and conntrack utilization.
- Alert on sync timestamp age without manual metric scraping across nodes.
- Flag xtables lock contention from kube-proxy logs.
- Track iptables rule count trends to forecast scaling limits.






