Kubernetes conntrack exhaustion: dropped connections under load
Intermittent connection timeouts under load in Kubernetes often trace to a full nf_conntrack table on the node. Existing TCP sessions stay open, but new connections fail silently. DNS resolution becomes unreliable. Application logs show timeouts to healthy dependencies. The root cause is usually not the application, network policy, or CNI, but kernel connection tracking exhaustion.
Every connection that traverses kube-proxy NAT rules creates an entry in the node’s nf_conntrack table. This finite, node-level table is shared by all workloads and invisible to most application monitoring. When it fills, the kernel drops new connection attempts without sending a TCP reset or ICMP error. The application sees a timeout.
What this means
The Linux kernel’s connection tracking subsystem (nf_conntrack) maintains state for every network connection that requires NAT. kube-proxy uses DNAT and SNAT to implement Kubernetes Services, which means nearly every pod-to-service connection creates a conntrack entry. These entries persist until the connection closes or a timeout fires.
Because conntrack is a node-level resource, all pods, host processes, and kubelet operations share one table. When the table reaches nf_conntrack_max, the kernel cannot allocate new entries. It silently drops SYN packets for TCP and new UDP flows. Existing established connections continue because their entries remain in the table, which makes the failure appear random and workload-dependent rather than systemic.
DNS usually fails first. CoreDNS relies on short-lived UDP queries to upstream resolvers. UDP conntrack entries accumulate without the natural cleanup signals that TCP provides, so a node under pressure often loses DNS before application TCP traffic fails.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Default table size too low for workload | Timeouts begin as traffic increases; small nodes with many pods hit limits first | nf_conntrack_count against nf_conntrack_max |
| High connection churn / TIME_WAIT accumulation | Microservices making many short HTTP requests fill the table with TIME_WAIT entries | conntrack -L -p tcp state distribution |
| UDP traffic accumulation | DNS, StatsD, or logging traffic creating entries that lack natural TCP cleanup | conntrack -L protocol breakdown |
| Connection leaks | Applications opening connections without closing them; entries never free until timeout | conntrack -L -d <pod-ip> age per destination |
| kube-proxy stale endpoint entries | Removed pods still have conntrack entries because cleanup failed or raced | conntrack -L -d <old-pod-ip> after a rollout |
| Bursty traffic or retry storms | A brief spike in new connections overwhelms the remaining table headroom | conntrack -S drop counter rate |
Quick checks
Run these checks on a node showing symptoms. They are read-only unless noted.
# Check conntrack utilization
awk '{c=$1} END {getline m < "/proc/sys/net/netfilter/nf_conntrack_max"; printf "%.1f%%\n", c*100/m}' /proc/sys/net/netfilter/nf_conntrack_count
# Check for active drops
conntrack -S
# Check kernel log for table-full messages
dmesg | grep -i "nf_conntrack.*table full"
# List conntrack entries by protocol to spot UDP accumulation
conntrack -L | awk '{print $1}' | sort | uniq -c | sort -rn
# Check TCP state distribution for TIME_WAIT bloat
conntrack -L -p tcp | awk '{print $4}' | sort | uniq -c | sort -rn
# Verify kube-proxy is syncing rules and not stuck
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# Check if IPVS conntrack is also filling (IPVS mode only)
ipvsadm -Lcn | wc -l
How to diagnose it
Confirm the table is full. Check the ratio of
nf_conntrack_counttonf_conntrack_max. If utilization is above 85%, the node is in the danger zone. Above 95%, new connections are likely being dropped.Confirm drops are occurring. Run
conntrack -Sand look for a non-zerodropcounter. If the counter is increasing, the kernel is actively rejecting new connections. Also checkdmesgfor the stringnf_conntrack: table full, dropping packet. This is definitive proof.Determine whether the problem is isolated or widespread. Check the same metrics on other nodes. Conntrack exhaustion is usually workload-dependent. If only one node is affected, look for a noisy neighbor pod or a connection leak on that node. If many nodes are affected, the cluster-wide traffic pattern or the default
nf_conntrack_maxis too low.Identify which protocol is filling the table. Use
conntrack -Lgrouped by protocol. If UDP dominates, suspect DNS, metrics, or logging traffic. If TCP dominates, inspect the TCP state distribution. A high proportion ofTIME_WAITindicates short-lived HTTP connections without reuse. A high proportion ofESTABLISHEDindicates long-lived or leaked connections.Correlate with workload changes. Check if the issue started after a deployment rollout, a scale-up event, or a configuration change that increased connection rates. Rolling updates spike conntrack usage when old and new endpoints coexist and kube-proxy has not yet flushed old entries.
Check kube-proxy sync health. A kube-proxy instance with a dead API server watch or a stalled sync loop may fail to clean up conntrack entries for removed endpoints. Verify that
kubeproxy_sync_proxy_rules_last_timestamp_secondsis advancing and that the process is not crash-looping.Check for stale endpoint entries. After a rolling update, run
conntrack -L -d <old-pod-ip>for IPs of terminated pods. If entries remain, kube-proxy’s cleanup did not run or lost a race. These stale entries consume table space until the TCP or UDP timeout expires.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
nf_conntrack_count / nf_conntrack_max | Measures table utilization. This is the primary indicator. | Sustained ratio above 70% |
conntrack -S drop counter | Confirms active packet loss due to exhaustion. | Any increasing value |
dmesg nf_conntrack: table full | Definitive kernel-level evidence of the failure. | Any occurrence |
TCP TIME_WAIT ratio | Reveals connection churn from short-lived HTTP flows. | TIME_WAIT exceeds 50% of TCP entries |
| UDP conntrack entry count | UDP lacks connection close signals, so entries accumulate silently. | UDP count growing steadily without traffic decrease |
| kube-proxy sync timestamp age | Stale rules prevent endpoint cleanup, extending conntrack lifetime. | Last sync older than 2 minutes |
| CoreDNS SERVFAIL rate | DNS fails first because UDP queries are small and frequent. | coredns_dns_responses_total{rcode="SERVFAIL"} increasing |
| Application connection timeout rate | The user-visible symptom of silent SYN drops. | Timeouts correlating with specific nodes |
Fixes
Immediate relief
Increase the table size. This is safe and takes effect immediately without restarting services.
# Double the conntrack limit (temporary)
sudo sysctl -w net.netfilter.nf_conntrack_max=131072
# Persist across reboots by adding to sysctl.d
echo "net.netfilter.nf_conntrack_max=131072" | sudo tee /etc/sysctl.d/99-conntrack.conf
Flush stale entries. If you have confirmed that old pod IPs are filling the table after a rollout, you can delete entries for a specific dead IP. This is state-changing but low-risk if the IP is truly terminated.
# Remove entries for a terminated pod IP (state-changing)
conntrack -D -d <old-pod-ip>
If the cause is connection churn
Reduce TCP TIME_WAIT accumulation by enabling connection reuse and pooling in clients. Ensure connections are closed properly.
Tune conntrack timeouts if your workload is dominated by short-lived flows. Lowering nf_conntrack_tcp_timeout_time_wait from the default 120 seconds can help, but this changes kernel behavior globally and should be tested in staging first.
# Reduce TIME_WAIT timeout (test before applying in production)
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
If the cause is UDP accumulation
Reduce UDP stream timeout for DNS and metrics traffic. The default nf_conntrack_udp_timeout_stream can be too high for high-churn workloads, causing entries to accumulate.
sudo sysctl -w net.netfilter.nf_conntrack_udp_timeout_stream=30
If the cause is kube-proxy stale rules
Restart kube-proxy on the affected node to force a full resync and conntrack cleanup. This is safe because existing kernel rules persist during the brief restart.
# Disruptive: restarts kube-proxy pods on the target node
kubectl delete pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=<node>
If the cause is proxy mode scaling limits
In iptables mode, kube-proxy holds the xtables lock during iptables-restore, which can delay syncs and extend the window during which stale conntrack entries persist. If your cluster runs thousands of Services, evaluate migrating to IPVS or nftables mode. These modes reduce sync duration and lock contention, which indirectly improves conntrack cleanup latency.
Prevention
Monitor conntrack utilization per node. Set alerts at 70% of nf_conntrack_max to provide runway before the cliff edge. Do not wait for drops.
Size nf_conntrack_max for your node density. A table of 1,000,000 entries consumes roughly 300 MB of kernel memory. Size the limit to accommodate peak connection count plus TIME_WAIT and UDP overhead, but ensure you have enough system memory.
Fix connection leaks at the application level. Conntrack exhaustion is often a symptom of clients that open connections without closing them. Application health checks, connection pool metrics, and file descriptor counts are leading indicators.
Tune timeouts for your traffic pattern. The defaults assume general-purpose servers. Nodes running high-churn microservices or DNS-heavy workloads benefit from shorter TCP TIME_WAIT and UDP stream timeouts.
Limit unnecessary connection creation. Readiness and liveness probes that create new TCP sessions on every execution contribute to table pressure. Where possible, configure probes to reuse connections or use less frequent intervals.
Review NodePort and ExternalTrafficPolicy usage. Services with externalTrafficPolicy: Local create additional health check connections. NodePort services on busy nodes increase the total connection count because every node must accept the traffic.
How Netdata helps
Netdata monitors these signals per node and correlates them for faster root-cause analysis:
- Conntrack utilization:
nf_conntrack_countagainstnf_conntrack_maxper node, with real-time history. - Kernel drops: Packets dropped by the conntrack subsystem, shown alongside TCP retransmission rates.
- Cross-signal context: Conntrack saturation correlated with CoreDNS SERVFAIL rates, kube-proxy sync latency, and pod-level connection counts.
Related guides
- Kubernetes DNS resolution failures inside pods
- Kubernetes node NotReady: kubelet, runtime, and network diagnosis
- Kubernetes pod stuck ContainerCreating: volume, network, and image issues
- Kubernetes monitoring checklist: the signals every production cluster needs
- Kubernetes API server slow or unresponsive: causes and fixes






