Kubernetes API server watch storm: re-list cascades and connection floods

A sudden wall of LIST requests pins API server CPU, climbs memory, spikes etcd read latency, and lags controllers. The culprit is usually a watch storm: hundreds or thousands of clients simultaneously re-listing because their watch connections failed or fell behind. Each re-list triggers an expensive etcd range scan and serializes all matching objects. Under load, this saturates CPU, fills network bandwidth, and can trigger APF throttling or memory pressure. If the API server restarts before the storm subsides, the cycle repeats.

What this means

Kubernetes informers use a list-then-watch cycle. The client LISTs to establish a baseline, then opens a long-lived WATCH to receive deltas. The API server serves most WATCH requests from an in-memory cache holding a bounded history window per resource type. If a watcher falls behind that window, or if the requested resourceVersion is too old, the server returns 410 Gone and closes the connection. The client must restart with a full LIST.

A watch storm begins when many clients lose WATCH connections simultaneously and issue fallback LISTs together. Each re-list is an etcd range scan followed by serialization of every matching object. In large clusters, a single LIST can allocate temporary memory roughly equal to several multiples of the response size. When hundreds of informers, controllers, and kubelets do this at once, CPU saturates, etcd read latency rises, and bandwidth fills with multi-megabyte payloads.

Connection floods compound the problem. After re-listing, clients immediately re-establish WATCH streams. If the API server has just restarted, its watch caches are empty. Clients that reconnect with a stale resourceVersion force another 410, triggering yet another re-list before the first wave subsides.

flowchart TD
    A[API server restart or cache overflow] --> B[Watch connections close]
    B --> C[Clients receive 410 Gone or TCP RST]
    C --> D[Mass re-list: expensive etcd scans]
    D --> E[LIST response serialization: CPU/memory spike]
    E --> F[etcd read latency rises]
    F --> G[New WATCH requests at stale RV fail]
    G --> H[Second wave of re-lists]
    E --> I[Inflight requests saturate / APF queues fill]
    I --> J[429 rejections and client backoff storms]

Common causes

CauseWhat it looks likeFirst thing to check
API server restart or rolling upgradeLIST rate spikes immediately after a restart; /readyz shows informer-sync not readykubectl get --raw='/readyz?verbose' and watch cache population metrics
Watch cache overflowSustained 410 responses; LIST bursts targeting one resource typeapiserver_request_total with code="410" and object churn rate
Direct etcd watch leaketcd memory grows unbounded; all watches on the apiserver terminate abruptlyetcd watcher count and memory; verify no clients access etcd directly
HA resourceVersion divergenceClients reconnecting to a different apiserver instance get 410 despite low churnCompare /readyz and latency across individual instances
Client retry stormRapid WATCH connection churn with no clear control plane triggerapiserver_longrunning_requests and audit logs for repeating clients

Quick checks

# LIST request rate
kubectl get --raw='/metrics' | grep 'apiserver_request_total{' | grep 'verb="LIST"'

# Active watch connections
kubectl get --raw='/metrics' | grep 'apiserver_longrunning_requests{' | grep 'verb="WATCH"'

# 410 Gone rate
kubectl get --raw='/metrics' | grep 'apiserver_request_total{' | grep 'code="410"'

# Inflight request saturation
kubectl get --raw='/metrics' | grep '^apiserver_current_inflight_requests'

# APF queue depth
kubectl get --raw='/metrics' | grep '^apiserver_flowcontrol_current_inqueue_requests'

# etcd list latency
kubectl get --raw='/metrics' | grep '^etcd_request_duration_seconds.*list'

# LIST response sizes
kubectl get --raw='/metrics' | grep '^apiserver_response_sizes.*verb="LIST"'

# Watch cache population <!-- TODO: verify metric name -->
kubectl get --raw='/metrics' | grep '^apiserver_storage_cache_list_items'

# API server connection count on the node
ss -tnp | grep kube-apiserver | wc -l

How to diagnose it

  1. Confirm the storm. Look for a spike in apiserver_request_total{verb="LIST"} and elevated apiserver_request_duration_seconds for LIST. Correlate with apiserver_response_sizes to verify that large payloads are driving latency. If p99 response size jumps at the same time as p99 latency, serialization is the bottleneck.

  2. Identify the trigger. Check whether the spike follows an API server restart or rolling upgrade. Compare the API server pod start times against the LIST spike. If /readyz reports informer-sync as failing, the watch cache is still warming. If there was no restart, look for a jump in apiserver_request_total{code="410"} indicating watch cache overflow.

  3. Check watch cache state. After a restart, watch cache population metrics will be near zero until the cache catches up. If the cache is warm but 410s persist, the cache window is likely too small for the rate of object churn.

  4. Check etcd health. If etcd_request_duration_seconds for list operations is elevated while the watch cache is cold, etcd is serving the re-lists directly. Check etcd_disk_wal_fsync_duration_seconds and etcd_network_peer_round_trip_time_seconds to rule out disk or network latency. If etcd memory is climbing and watches are being terminated globally, investigate whether clients are accessing etcd directly outside the API server.

  5. Check for saturation. If apiserver_current_inflight_requests exceeds 80% of the configured --max-requests-inflight limit, or if APF queues are rejecting requests, the storm is causing self-throttling. Check apiserver_flowcontrol_rejected_requests_total to see which priority levels are being throttled.

  6. Inspect client behavior. Use audit logs or API server logs to identify whether a specific controller, operator, or node agent is responsible for most LIST traffic. Look for the same user-agent or source IP re-listing repeatedly without exponential backoff. Check if the client is ignoring 410s and re-listing in a tight loop.

  7. Verify HA consistency. In HA clusters, compare /readyz and per-instance LIST latency across API server instances. If one instance returns 410s while others do not, or if its latency is an outlier, its watch cache may be lagging behind etcd. Check that your load balancer is not flipping clients between instances during reconnect storms.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
LIST request rateDirect measure of re-list loadSustained rate greater than 5x baseline
LIST latency p99Serialization and etcd scan costp99 greater than 5 seconds for core resources
LIST response sizesLarge objects amplify memory and CPU costSpikes correlating with latency
Active WATCH connectionsDisconnects trigger re-listsSharp drop followed by rapid rebound
410 Gone rateWatch cache history exceeded or RV staleSustained increase after a restart
Inflight requestsGlobal concurrency saturationGreater than 80% of --max-requests-inflight
APF queue depthPriority traffic being delayedNon-zero queues for system or leader-election levels
etcd list latencyStorage backend bottleneckp99 greater than 200 ms
Watch cache object countCold cache forces etcd readsNear zero after restart; flatline during growth
API server memoryLIST allocates temporary memory proportional to response sizeRSS greater than 80% of container limit
Goroutine countConnection churn or goroutine leaksGrowth without corresponding traffic increase

Fixes

If the cause is a restart or rolling upgrade

Do not restart the API server again. Allow the watch cache to warm. Ensure the container memory limit has headroom for the re-list burst, which can temporarily spike RSS 2-3x above baseline. If the API server is crash-looping due to OOM, raise the memory limit before it restarts, and set GOMEMLIMIT to roughly 90% of the container limit so the Go runtime triggers garbage collection earlier.

If the cause is watch cache overflow

Increase --watch-cache-sizes for the affected resource type. The default cache capacity is set at startup and cannot be changed dynamically, so an API server restart is required to apply new values. Schedule the restart during a maintenance window or ensure remaining instances can absorb the shifted load. Before restarting, ensure the new capacity exceeds the peak object count for that resource; churn rate determines how quickly the cache cycles.

If the cause is direct etcd watches

Identify clients that are bypassing the API server watch cache, including direct etcd watch sessions. These watches hold unbounded buffers and can exhaust etcd’s watch window, starving all other watchers and forcing global termination. Reconfigure the client to use the API server or to adopt a current resourceVersion.

If the cause is HA resourceVersion divergence

Verify that your load balancer is not flipping clients between API server instances during reconnect storms. If one instance is consistently behind, investigate why its watch cache is not catching up. Temporarily reducing the number of serving instances can force clients onto a stable, synced instance while the lagging instance recovers. Warning: this concentrates load on the remaining instances and can worsen the storm. Use only if the remaining instances have confirmed CPU, memory, and network headroom.

If the cause is APF or inflight saturation

Raise --max-requests-inflight and --max-mutating-requests-inflight only if the host has CPU and memory headroom. Do not increase limits on an already memory-starved API server. Tune APF PriorityLevelConfiguration to give system and leader-election flows more concurrency shares so that bulk re-list traffic does not starve critical control plane loops.

Prevention

  • Size API server memory limits for burst load, not steady state. Re-list storms can allocate temporary memory equal to several multiples of the response size.
  • Tune --watch-cache-sizes proactively as object counts grow. Monitor apiserver_storage_objects per resource type to predict cache pressure.
  • Ensure APF flow schemas isolate bulk LIST traffic from system-critical traffic. Monitor apiserver_flowcontrol_rejected_requests_total.
  • Audit controllers and operators for direct etcd WATCH behavior and for aggressive reconnection logic without exponential backoff.
  • Use API server readiness checks in your load balancer to avoid routing traffic to instances with cold watch caches.
  • Monitor etcd disk latency and leader stability. Slow etcd extends the time the watch cache takes to catch up after a restart.

How Netdata helps

  • Correlate LIST latency, API server RSS, inflight requests, and etcd list latency on one timeline to distinguish cache pressure from etcd saturation.
  • Track active WATCH connections and watch event rates to spot disconnections before they trigger re-lists.
  • Alert on composite signals: LIST latency spikes combined with 410 rates and memory pressure.