Kubernetes stale conntrack during rolling updates: intermittent connection resets
You deploy a new version of a service. The rollout reports healthy. Error budgets look fine. Then you notice sporadic connection resets in application logs, a handful of timeout errors, or brief latency spikes that correlate exactly with pod terminations. The failures are intermittent, last only seconds, and never trigger a full outage. This is the stale conntrack race during rolling updates.
When a backend pod terminates, kube-proxy removes the endpoint from iptables or IPVS rules, but the kernel connection tracking table may still hold state for active TCP flows. Return packets matched against stale entries bypass NAT reversal, reach the client with an unexpected source IP, and trigger a RST. The result is a brief burst of connection failures that operators often misattribute to application bugs or network flapping.
What this means
kube-proxy implements Kubernetes Services by programming DNAT rules that rewrite traffic from a ClusterIP to a backend Pod IP. The kernel’s conntrack subsystem tracks these NAT mappings so return packets are rewritten back to the Service IP.
During a rolling update, a pod enters Terminating. kube-proxy removes the endpoint from the Service backend set, usually by deleting the KUBE-SEP-* iptables chain or removing the real server from IPVS. For active or recently active TCP connections, the conntrack entry often persists after the endpoint is gone. When a return packet for one of those flows arrives, conntrack still has the old Pod IP in its reply path. Instead of reversing DNAT and rewriting the source to the Service IP, the packet is forwarded to the client with the Pod IP as the source.
The client stack recognizes an unexpected source address. In many cases the kernel marks the packet as INVALID, causing it to bypass NAT reversal. The client responds with a TCP RST, killing the connection. Because this only affects flows active when the old pod terminated, the failures are intermittent and brief.
The issue is most visible in clusters with high connection reuse, long-lived requests, or workloads that open many concurrent connections to a Service during a rollout. It is not a kube-proxy crash or a network partition. It is a state synchronization race between the control plane’s view of endpoints and the kernel’s view of active connections.
flowchart TD
A[Rolling update: old pod terminates] --> B[kube-proxy removes endpoint from rules]
B --> C[Kernel conntrack entry for active TCP flow persists]
C --> D[Return packet from old Pod IP arrives]
D --> E[Conntrack state bypasses DNAT reversal]
E --> F[Client receives packet from Pod IP instead of Service IP]
F --> G[Client sends TCP RST]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Race between endpoint removal and conntrack flush | Intermittent RSTs on existing connections exactly when old pods terminate | Stale conntrack entries pointing to a terminated Pod IP |
| Insufficient graceful termination period | RSTs spike immediately when a pod enters Terminating, before in-flight requests complete | terminationGracePeriodSeconds and application SIGTERM handling |
| Application exits immediately on SIGTERM | App stops serving while conntrack still holds active flow state | Container lifecycle hooks and graceful shutdown logic |
| kube-proxy sync lag | Stale rules persist longer than expected, widening the race window | kubeproxy_sync_proxy_rules_last_timestamp_seconds age |
Quick checks
# Check for stale conntrack entries pointing to a terminated Pod IP
conntrack -L | grep <old-pod-ip>
# Watch conntrack entries for a Service ClusterIP during a rollout
watch "conntrack -L -d <service-cluster-ip>"
# Check the last time kube-proxy successfully synced rules
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# Verify how long the pod has to drain connections
kubectl get pod <pod> -o jsonpath='{.spec.terminationGracePeriodSeconds}'
# Check whether the kernel marks unusual TCP packets as INVALID
sysctl net.netfilter.nf_conntrack_tcp_be_liberal
What good and bad output looks like
conntrack -L | grep <old-pod-ip>returning entries after the pod is fully terminated confirms stale state.kubeproxy_sync_proxy_rules_last_timestamp_secondsmore than 60 seconds old means kube-proxy is not processing endpoint changes.terminationGracePeriodSecondsof30or less is often too short for applications with long in-flight requests.nf_conntrack_tcp_be_liberalreturning0means the kernel is strict about marking out-of-window packets asINVALID, which amplifies the RST behavior.
How to diagnose it
Follow this flow to confirm stale conntrack is causing your resets and to identify the contributing factor.
Correlate failures with deployment events.
Check application logs, ingress metrics, or client-side connection reset counters. If RSTs or timeouts cluster within seconds of a pod enteringTerminating, the correlation confirms the race.Identify the affected node and service.
Map the source IP and destination port of the failing connections to a Kubernetes Service. Determine which backend node the traffic was routed through. This is where the stale conntrack entry lives.Capture conntrack state on the node.
SSH to the node or use a privileged debug pod. Runconntrack -L | grep <old-pod-ip>for the terminated pod, andconntrack -L -d <service-cluster-ip>for the Service. If entries for the old Pod IP persist after the pod is deleted and no longer appears inkubectl get endpoints, the race is confirmed.Verify kube-proxy processed the endpoint removal.
Querykubeproxy_sync_proxy_rules_last_timestamp_secondsand check kube-proxy logs. If the timestamp is recent and logs show the endpoint was removed, the issue is the kernel conntrack race, not a sync failure. If the timestamp is stale, investigate kube-proxy sync lag first.Evaluate graceful termination behavior.
Check whether the application stops accepting new connections and waits for in-flight work to finish after receiving SIGTERM. If the process exits immediately on SIGTERM, it is dying before the conntrack entries naturally expire, widening the window for resets.Check for kernel-level INVALID drops.
Ifnf_conntrack_tcp_be_liberalis0, the kernel is likely marking valid return packets asINVALIDbecause their sequence numbers fall outside the expected window after the endpoint change. Setting this sysctl to1tells the kernel to be more lenient and is a standard mitigation.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kubeproxy_sync_proxy_rules_last_timestamp_seconds | Frozen timestamp means endpoint removals are not being programmed | Timestamp more than 60 seconds old |
kubeproxy_sync_proxy_rules_duration_seconds | Long syncs delay endpoint removal and extend the conntrack race window | p99 sync duration exceeds 5 seconds |
kubeproxy_sync_proxy_rules_endpoint_changes_total | High endpoint churn forces frequent full syncs and can delay removal | Rate sustained above baseline during rollouts |
rest_client_requests_total from kube-proxy | API connectivity loss prevents timely rule updates | Error codes 4xx or 5xx from API server |
| Conntrack table utilization | General pressure amplifies drop and reset behavior | nf_conntrack_count above 75 percent of nf_conntrack_max |
| Application connection reset rate | Direct symptom of the failure mode | Spikes correlating with rolling updates |
Fixes
If the cause is a conntrack NAT race
The most effective kernel-level mitigation is to set net.netfilter.nf_conntrack_tcp_be_liberal=1 on every node. This prevents the kernel from marking legitimate return packets as INVALID solely because their TCP sequence numbers fall outside the expected window after an endpoint change. Many managed Kubernetes distributions set this automatically; if you run self-managed nodes, add it to your node bootstrap or a sysctl-tuning DaemonSet.
As a defensive filter, you can drop INVALID state packets before they leave the node:
iptables -t filter -I INPUT -p tcp -m conntrack --ctstate INVALID -j DROP
Warning: This drops packets and does not preserve the connection. Test in staging first.
If you are running Kubernetes 1.26 or later, terminating endpoints are enabled by default. This feature keeps a terminating pod registered in the EndpointSlice during its grace period, giving kube-proxy a consistent backend target and reducing orphaned conntrack state. Ensure your cluster is not disabling this behavior.
To clear stale state during an active incident:
conntrack -D -d <old-pod-ip>
If the cause is premature pod shutdown
Increase terminationGracePeriodSeconds to give the application enough time to drain active connections before the container runtime sends SIGKILL. The value should exceed the longest in-flight request duration your application handles.
Ensure the application responds to SIGTERM by closing its listener and waiting for open requests to complete, rather than exiting immediately. If your framework or container image does not handle this natively, add a preStop lifecycle hook that signals the application to begin draining before Kubernetes sends SIGTERM.
If the cause is UDP-specific conntrack lag
For UDP services, including CoreDNS, conntrack cleanup behavior differs from TCP. If you are in IPVS mode, stale UDP session affinity can cause traffic to stick to dead backends. Reduce the IPVS UDP timeout to limit persistence:
ipvsadm --set <tcp_timeout> <tcp_fin_timeout> <udp_timeout>
If you are in iptables mode and see UDP drops during rollouts, verify that kube-proxy is flushing UDP conntrack entries on endpoint removal. Manual flushing with conntrack -D can provide temporary relief.
Prevention
- Set
net.netfilter.nf_conntrack_tcp_be_liberal=1on all nodes and persist the setting across reboots via sysctl configuration. - Validate that every service implements graceful shutdown: stop accepting new connections, complete in-flight work, then exit.
- Size
terminationGracePeriodSecondsto match your actual request latency distribution, not just a default value. - Monitor kube-proxy sync latency and endpoint change rates. A kube-proxy that cannot keep up with endpoint churn leaves a larger window for stale conntrack races.
- Run load tests that include rolling updates under realistic connection concurrency to measure baseline reset rates before changes reach production.
How Netdata helps
In Netdata, look for:
kubeproxy_sync_proxy_rules_duration_secondsand endpoint change rates climbing before resets.- Node-level conntrack utilization and INVALID packet rate rising during rollouts.
- Application error spikes aligned with pod termination events in the Deployment timeline.
Related guides
- For broader conntrack capacity issues, see Kubernetes conntrack exhaustion: detection and recovery.
- If your rolling update is not progressing as expected, see Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas.
- For DNS timeouts that may share root causes with UDP conntrack behavior, see Kubernetes DNS resolution failures inside pods.






