$ guides / kubernetes / kubernetes-ipvs-stale-rules ▌

Operations Guides

Kubernetes kube-proxy IPVS: stale rules and session affinity issues

DNS queries start timing out from one node after a CoreDNS rolling update. A UDP Service returns timeouts for some clients but not others. New Services are unreachable from a specific node while older Services continue to work.

In IPVS mode, kube-proxy programs the kernel’s IPVS table with virtual servers and real servers. The IPVS connection table lives outside kube-proxy’s direct control and outside nf_conntrack. That separation creates two IPVS-specific failure modes: stale rules that diverge from EndpointSlice state, and UDP session affinity that sticks to dead backends long after a pod terminates.

What this means

kube-proxy in IPVS mode does not proxy traffic in userspace. It creates IPVS virtual servers for each Service ClusterIP and registers real servers for each endpoint. The kernel handles forwarding via hash lookups. This scales better than iptables for large clusters, but it introduces two failure modes that behave differently than in iptables mode.

First, stale rules. If kube-proxy’s sync loop falls behind, its API server watch silently dies, or a sync fails partially, the IPVS virtual servers and real servers on a node may diverge from the current EndpointSlice state. Traffic continues to flow through the kernel, but it flows to backends that no longer exist or misses backends that were just created.

Second, UDP session affinity sticking to dead backends. IPVS maintains its own connection table, separate from nf_conntrack. For UDP traffic, IPVS treats a flow from a specific source IP and source port as a session and remembers which real server received the first packet. When that backend pod is terminated, the IPVS connection entry remains active with a default UDP timeout of 300 seconds. New packets from the same client continue to be forwarded to the dead pod IP until that timeout expires. kube-proxy’s conntrack cleanup flushes nf_conntrack entries, but it does not flush the IPVS connection table. This is particularly devastating for CoreDNS and other UDP-based cluster services.

flowchart TD
    A[UDP packet to ClusterIP] --> B{IPVS connection table lookup}
    B -->|Existing entry found| C[Forward to old backend IP]
    B -->|No entry| D[Round-robin to current backend]
    C --> E{Old pod terminated?}
    E -->|Yes| F[Packet blackholed or dropped]
    E -->|No| G[Normal response]
    F --> H[Client retries from same source IP and port]
    H --> A
    D --> I[Create new IPVS connection entry]
    I --> G

Common causes

Cause	What it looks like	First thing to check
IPVS UDP timeout keeping dead backends	DNS or UDP timeouts from specific nodes after a pod rollout; only some clients affected	`ipvsadm -Lcn` showing connections to terminated pod IPs
Silent API server watch death	New Services unreachable from one node while existing Services work; no obvious errors	Age of `kubeproxy_sync_proxy_rules_last_timestamp_seconds` on the node
Sync loop backlog	Endpoint changes take minutes to appear in IPVS; rolling updates cause intermittent drops	`kubeproxy_sync_proxy_rules_duration_seconds` p99 versus the sync period
IPVS real server leak	Deleted pod IP still appears as a real server with traffic directed to it	`ipvsadm -Ln` real server list compared to current EndpointSlices
Conntrack exhaustion alongside IPVS	UDP packets dropped despite healthy endpoints; DNS fails cluster-wide	`conntrack -S` drop counter and `nf_conntrack_count` versus `nf_conntrack_max`

Quick checks

# List IPVS virtual servers and real servers
sudo ipvsadm -Ln

# Show IPVS connection table entries with source-to-backend mappings
sudo ipvsadm -Lcn

# Check age of the last successful sync from kube-proxy metrics
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds

# Check sync duration percentiles
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration_seconds

# Compare current endpoint IPs to IPVS real servers for a service
kubectl get endpointslices -l kubernetes.io/service-name=coredns -o json | jq -r '.items[].endpoints[].addresses[]'
sudo ipvsadm -Ln -t <cluster-ip>:53

# Check conntrack entries for a specific ClusterIP
sudo conntrack -L -d <cluster-ip> 2>/dev/null

# Check kube-proxy logs for IPVS or sync errors
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep -iE "error|ipvs|sync"

How to diagnose it

Confirm the node is in IPVS mode. Run sudo ipvsadm -Ln. If it returns virtual servers, the node is using IPVS. If the output is empty, kube-proxy may be in iptables mode and this article’s IPVS-specific guidance does not apply. Why: iptables mode does not maintain a separate per-flow connection table for UDP. Sticking to dead backends is an IPVS-specific behavior.
Check for stuck UDP session affinity. Run sudo ipvsadm -Lcn | grep <old-pod-ip> to see if IPVS connections still point to a terminated backend. Why: IPVS tracks UDP flows independently. Even after nf_conntrack entries expire, IPVS can retain the source-to-backend mapping for the duration of its UDP timeout.
Verify kube-proxy sync health. Query kubeproxy_sync_proxy_rules_last_timestamp_seconds on http://localhost:10249/metrics. If the value is older than 2-3 minutes, the sync loop is stalled or the API watch is dead. Why: A kube-proxy process can pass its healthz check while operating on frozen state. The last successful sync timestamp is the correctness signal; healthz is only a liveness signal.
Compare API state to IPVS state. Get the current EndpointSlice addresses for the affected Service and compare them against the real servers shown in sudo ipvsadm -Ln. Why: Discrepancies reveal whether the issue is stale rules from sync lag, or active IPVS connection entries that have outlived their endpoint.
Check for conntrack overlap. Run sudo conntrack -L -d <cluster-ip> and cross-reference the entries with sudo ipvsadm -Lcn. Why: Both tables can contain stale entries simultaneously. Cleaning nf_conntrack without addressing the IPVS table leaves the affinity problem unresolved.
Determine whether the scope is node-local or cluster-wide. If one node is affected, suspect a local kube-proxy watch failure or sync stall. If all nodes show the same stale entries or DNS timeouts, suspect a cluster-wide endpoint change event, a conntrack saturation wave, or a configuration issue. Why: IPVS state is per-node. Localized symptoms point to node-level kube-proxy failure rather than a cluster control plane outage.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`kubeproxy_sync_proxy_rules_duration_seconds` p99	Measures time to reconcile IPVS rules	p99 exceeding 5-10 seconds indicates sync lag
`kubeproxy_sync_proxy_rules_last_timestamp_seconds`	Shows freshness of the last successful sync	Timestamp older than 2 minutes means stale rules
IPVS active connection count (`ipvsadm -Ln`)	Reveals load distribution and stuck flows	ActiveConn on a weight=0 or missing real server
`rest_client_requests_total` from kube-proxy	Tracks API server connectivity	5xx, 403, or 429 errors indicate watch problems
`nf_conntrack_count` vs `nf_conntrack_max`	Shared kernel resource used alongside IPVS	Above 75% increases risk of silent connection drops
Conntrack `drop` counter (`conntrack -S`)	Confirms packet loss from full table	Any increment means new connections are failing

Fixes

If UDP session affinity is stuck to a dead backend

Reduce the IPVS UDP timeout so stale entries expire faster. The default is 300 seconds.

# Reduce UDP timeout to 60 seconds (emergency; affects all IPVS UDP services on the node)
# Arguments are TCP, TCP_FIN, and UDP timeouts respectively
sudo ipvsadm --set 900 120 60

This change is immediate but non-persistent across reboots. Document the node and revert or persist via your node configuration management after the incident.

To clear existing stuck entries immediately, delete the specific virtual server. kube-proxy recreates it on the next sync.

# WARNING: Traffic to this ClusterIP will drop until kube-proxy recreates the virtual server.
# Use only when waiting for timeout expiration is not acceptable.
sudo ipvsadm -D -t <cluster-ip>:<port>

If you cannot tolerate the brief drop, lower the timeout and wait for the stale entries to expire.

If kube-proxy sync is stalled

Restart kube-proxy on the affected node to force a full re-sync.

# Delete the kube-proxy pod; the DaemonSet will recreate it
kubectl delete pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=<node-name>

After restart, allow 30 to 60 seconds for the initial sync to complete. The first sync programs all rules from scratch and will take longer than incremental syncs.

If conntrack is exhausting alongside IPVS issues

IPVS mode still uses nf_conntrack for masquerading. Increase the table limit as immediate relief.

# Immediate relief
sudo sysctl -w net.netfilter.nf_conntrack_max=262144

Then investigate connection leaks or churn that are filling the table. Conntrack exhaustion affects all traffic on the node, not just the Service layer.

If IPVS real servers outlive their endpoints

When the sync loop cannot remove a real server, verify that the EndpointSlice has removed the backend. If the API state is correct but ipvsadm -Ln still shows the old real server, restart kube-proxy to force a full state rebuild.

Prevention

Monitor kubeproxy_sync_proxy_rules_last_timestamp_seconds on every node and alert when it is older than two minutes. A passing healthz check does not guarantee that rules are current.
Track IPVS UDP timeout defaults in your node baseline. If you run CoreDNS or other UDP services in IPVS mode, lower the UDP timeout proactively rather than waiting for an incident.
Size nf_conntrack_max for your node workload density. IPVS mode still relies on conntrack for masquerading.
During rolling updates of UDP-backed Services, watch ipvsadm -Lcn for connection counts to terminating pods. If counts are high, consider a controlled virtual server deletion before the update.
Ensure API server load balancers support long-lived connections. Dropped watch connections are a leading cause of silent sync death.
Periodically compare EndpointSlice state against ipvsadm -Ln output as a consistency check, especially after node recoveries or kube-proxy restarts.

How Netdata helps

Netdata correlates the signals that isolate IPVS failures from generic network issues:

Per-node conntrack utilization and drop rates alongside kube-proxy process health to distinguish table exhaustion from rule staleness.
kube-proxy metrics endpoint scraping for sync duration and last sync timestamp freshness.
Kernel-level connection and socket metrics alongside container health to surface when IPVS entries stick to terminating backends.

For conntrack table exhaustion diagnosis, see Kubernetes conntrack exhaustion: dropped connections under load.
For sync loop issues in iptables mode, see Kubernetes kube-proxy iptables sync stall: causes and recovery.
For DNS-specific failure patterns inside pods, see Kubernetes DNS resolution failures inside pods.
For the full signal taxonomy for Kubernetes networking, see Kubernetes monitoring checklist: the signals every production cluster needs.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes kube-proxy IPVS: stale rules and session affinity issues

Kubernetes kube-proxy IPVS: stale rules and session affinity issues

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If UDP session affinity is stuck to a dead backend

If kube-proxy sync is stalled

If conntrack is exhausting alongside IPVS issues

If IPVS real servers outlive their endpoints

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata