Kubernetes kube-proxy IPVS: stale rules and session affinity issues

DNS queries start timing out from one node after a CoreDNS rolling update. A UDP Service returns timeouts for some clients but not others. New Services are unreachable from a specific node while older Services continue to work.

In IPVS mode, kube-proxy programs the kernel’s IPVS table with virtual servers and real servers. The IPVS connection table lives outside kube-proxy’s direct control and outside nf_conntrack. That separation creates two IPVS-specific failure modes: stale rules that diverge from EndpointSlice state, and UDP session affinity that sticks to dead backends long after a pod terminates.

What this means

kube-proxy in IPVS mode does not proxy traffic in userspace. It creates IPVS virtual servers for each Service ClusterIP and registers real servers for each endpoint. The kernel handles forwarding via hash lookups. This scales better than iptables for large clusters, but it introduces two failure modes that behave differently than in iptables mode.

First, stale rules. If kube-proxy’s sync loop falls behind, its API server watch silently dies, or a sync fails partially, the IPVS virtual servers and real servers on a node may diverge from the current EndpointSlice state. Traffic continues to flow through the kernel, but it flows to backends that no longer exist or misses backends that were just created.

Second, UDP session affinity sticking to dead backends. IPVS maintains its own connection table, separate from nf_conntrack. For UDP traffic, IPVS treats a flow from a specific source IP and source port as a session and remembers which real server received the first packet. When that backend pod is terminated, the IPVS connection entry remains active with a default UDP timeout of 300 seconds. New packets from the same client continue to be forwarded to the dead pod IP until that timeout expires. kube-proxy’s conntrack cleanup flushes nf_conntrack entries, but it does not flush the IPVS connection table. This is particularly devastating for CoreDNS and other UDP-based cluster services.

flowchart TD
    A[UDP packet to ClusterIP] --> B{IPVS connection table lookup}
    B -->|Existing entry found| C[Forward to old backend IP]
    B -->|No entry| D[Round-robin to current backend]
    C --> E{Old pod terminated?}
    E -->|Yes| F[Packet blackholed or dropped]
    E -->|No| G[Normal response]
    F --> H[Client retries from same source IP and port]
    H --> A
    D --> I[Create new IPVS connection entry]
    I --> G

Common causes

CauseWhat it looks likeFirst thing to check
IPVS UDP timeout keeping dead backendsDNS or UDP timeouts from specific nodes after a pod rollout; only some clients affectedipvsadm -Lcn showing connections to terminated pod IPs
Silent API server watch deathNew Services unreachable from one node while existing Services work; no obvious errorsAge of kubeproxy_sync_proxy_rules_last_timestamp_seconds on the node
Sync loop backlogEndpoint changes take minutes to appear in IPVS; rolling updates cause intermittent dropskubeproxy_sync_proxy_rules_duration_seconds p99 versus the sync period
IPVS real server leakDeleted pod IP still appears as a real server with traffic directed to itipvsadm -Ln real server list compared to current EndpointSlices
Conntrack exhaustion alongside IPVSUDP packets dropped despite healthy endpoints; DNS fails cluster-wideconntrack -S drop counter and nf_conntrack_count versus nf_conntrack_max

Quick checks

# List IPVS virtual servers and real servers
sudo ipvsadm -Ln
# Show IPVS connection table entries with source-to-backend mappings
sudo ipvsadm -Lcn
# Check age of the last successful sync from kube-proxy metrics
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# Check sync duration percentiles
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration_seconds
# Compare current endpoint IPs to IPVS real servers for a service
kubectl get endpointslices -l kubernetes.io/service-name=coredns -o json | jq -r '.items[].endpoints[].addresses[]'
sudo ipvsadm -Ln -t <cluster-ip>:53
# Check conntrack entries for a specific ClusterIP
sudo conntrack -L -d <cluster-ip> 2>/dev/null
# Check kube-proxy logs for IPVS or sync errors
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep -iE "error|ipvs|sync"

How to diagnose it

  1. Confirm the node is in IPVS mode. Run sudo ipvsadm -Ln. If it returns virtual servers, the node is using IPVS. If the output is empty, kube-proxy may be in iptables mode and this article’s IPVS-specific guidance does not apply. Why: iptables mode does not maintain a separate per-flow connection table for UDP. Sticking to dead backends is an IPVS-specific behavior.

  2. Check for stuck UDP session affinity. Run sudo ipvsadm -Lcn | grep <old-pod-ip> to see if IPVS connections still point to a terminated backend. Why: IPVS tracks UDP flows independently. Even after nf_conntrack entries expire, IPVS can retain the source-to-backend mapping for the duration of its UDP timeout.

  3. Verify kube-proxy sync health. Query kubeproxy_sync_proxy_rules_last_timestamp_seconds on http://localhost:10249/metrics. If the value is older than 2-3 minutes, the sync loop is stalled or the API watch is dead. Why: A kube-proxy process can pass its healthz check while operating on frozen state. The last successful sync timestamp is the correctness signal; healthz is only a liveness signal.

  4. Compare API state to IPVS state. Get the current EndpointSlice addresses for the affected Service and compare them against the real servers shown in sudo ipvsadm -Ln. Why: Discrepancies reveal whether the issue is stale rules from sync lag, or active IPVS connection entries that have outlived their endpoint.

  5. Check for conntrack overlap. Run sudo conntrack -L -d <cluster-ip> and cross-reference the entries with sudo ipvsadm -Lcn. Why: Both tables can contain stale entries simultaneously. Cleaning nf_conntrack without addressing the IPVS table leaves the affinity problem unresolved.

  6. Determine whether the scope is node-local or cluster-wide. If one node is affected, suspect a local kube-proxy watch failure or sync stall. If all nodes show the same stale entries or DNS timeouts, suspect a cluster-wide endpoint change event, a conntrack saturation wave, or a configuration issue. Why: IPVS state is per-node. Localized symptoms point to node-level kube-proxy failure rather than a cluster control plane outage.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kubeproxy_sync_proxy_rules_duration_seconds p99Measures time to reconcile IPVS rulesp99 exceeding 5-10 seconds indicates sync lag
kubeproxy_sync_proxy_rules_last_timestamp_secondsShows freshness of the last successful syncTimestamp older than 2 minutes means stale rules
IPVS active connection count (ipvsadm -Ln)Reveals load distribution and stuck flowsActiveConn on a weight=0 or missing real server
rest_client_requests_total from kube-proxyTracks API server connectivity5xx, 403, or 429 errors indicate watch problems
nf_conntrack_count vs nf_conntrack_maxShared kernel resource used alongside IPVSAbove 75% increases risk of silent connection drops
Conntrack drop counter (conntrack -S)Confirms packet loss from full tableAny increment means new connections are failing

Fixes

If UDP session affinity is stuck to a dead backend

Reduce the IPVS UDP timeout so stale entries expire faster. The default is 300 seconds.

# Reduce UDP timeout to 60 seconds (emergency; affects all IPVS UDP services on the node)
# Arguments are TCP, TCP_FIN, and UDP timeouts respectively
sudo ipvsadm --set 900 120 60

This change is immediate but non-persistent across reboots. Document the node and revert or persist via your node configuration management after the incident.

To clear existing stuck entries immediately, delete the specific virtual server. kube-proxy recreates it on the next sync.

# WARNING: Traffic to this ClusterIP will drop until kube-proxy recreates the virtual server.
# Use only when waiting for timeout expiration is not acceptable.
sudo ipvsadm -D -t <cluster-ip>:<port>

If you cannot tolerate the brief drop, lower the timeout and wait for the stale entries to expire.

If kube-proxy sync is stalled

Restart kube-proxy on the affected node to force a full re-sync.

# Delete the kube-proxy pod; the DaemonSet will recreate it
kubectl delete pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=<node-name>

After restart, allow 30 to 60 seconds for the initial sync to complete. The first sync programs all rules from scratch and will take longer than incremental syncs.

If conntrack is exhausting alongside IPVS issues

IPVS mode still uses nf_conntrack for masquerading. Increase the table limit as immediate relief.

# Immediate relief
sudo sysctl -w net.netfilter.nf_conntrack_max=262144

Then investigate connection leaks or churn that are filling the table. Conntrack exhaustion affects all traffic on the node, not just the Service layer.

If IPVS real servers outlive their endpoints

When the sync loop cannot remove a real server, verify that the EndpointSlice has removed the backend. If the API state is correct but ipvsadm -Ln still shows the old real server, restart kube-proxy to force a full state rebuild.

Prevention

  • Monitor kubeproxy_sync_proxy_rules_last_timestamp_seconds on every node and alert when it is older than two minutes. A passing healthz check does not guarantee that rules are current.
  • Track IPVS UDP timeout defaults in your node baseline. If you run CoreDNS or other UDP services in IPVS mode, lower the UDP timeout proactively rather than waiting for an incident.
  • Size nf_conntrack_max for your node workload density. IPVS mode still relies on conntrack for masquerading.
  • During rolling updates of UDP-backed Services, watch ipvsadm -Lcn for connection counts to terminating pods. If counts are high, consider a controlled virtual server deletion before the update.
  • Ensure API server load balancers support long-lived connections. Dropped watch connections are a leading cause of silent sync death.
  • Periodically compare EndpointSlice state against ipvsadm -Ln output as a consistency check, especially after node recoveries or kube-proxy restarts.

How Netdata helps

Netdata correlates the signals that isolate IPVS failures from generic network issues:

  • Per-node conntrack utilization and drop rates alongside kube-proxy process health to distinguish table exhaustion from rule staleness.
  • kube-proxy metrics endpoint scraping for sync duration and last sync timestamp freshness.
  • Kernel-level connection and socket metrics alongside container health to surface when IPVS entries stick to terminating backends.