Kubernetes API server watch storm: re-list cascades and connection floods
A sudden wall of LIST requests pins API server CPU, climbs memory, spikes etcd read latency, and lags controllers. The culprit is usually a watch storm: hundreds or thousands of clients simultaneously re-listing because their watch connections failed or fell behind. Each re-list triggers an expensive etcd range scan and serializes all matching objects. Under load, this saturates CPU, fills network bandwidth, and can trigger APF throttling or memory pressure. If the API server restarts before the storm subsides, the cycle repeats.
What this means
Kubernetes informers use a list-then-watch cycle. The client LISTs to establish a baseline, then opens a long-lived WATCH to receive deltas. The API server serves most WATCH requests from an in-memory cache holding a bounded history window per resource type. If a watcher falls behind that window, or if the requested resourceVersion is too old, the server returns 410 Gone and closes the connection. The client must restart with a full LIST.
A watch storm begins when many clients lose WATCH connections simultaneously and issue fallback LISTs together. Each re-list is an etcd range scan followed by serialization of every matching object. In large clusters, a single LIST can allocate temporary memory roughly equal to several multiples of the response size. When hundreds of informers, controllers, and kubelets do this at once, CPU saturates, etcd read latency rises, and bandwidth fills with multi-megabyte payloads.
Connection floods compound the problem. After re-listing, clients immediately re-establish WATCH streams. If the API server has just restarted, its watch caches are empty. Clients that reconnect with a stale resourceVersion force another 410, triggering yet another re-list before the first wave subsides.
flowchart TD
A[API server restart or cache overflow] --> B[Watch connections close]
B --> C[Clients receive 410 Gone or TCP RST]
C --> D[Mass re-list: expensive etcd scans]
D --> E[LIST response serialization: CPU/memory spike]
E --> F[etcd read latency rises]
F --> G[New WATCH requests at stale RV fail]
G --> H[Second wave of re-lists]
E --> I[Inflight requests saturate / APF queues fill]
I --> J[429 rejections and client backoff storms]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| API server restart or rolling upgrade | LIST rate spikes immediately after a restart; /readyz shows informer-sync not ready | kubectl get --raw='/readyz?verbose' and watch cache population metrics |
| Watch cache overflow | Sustained 410 responses; LIST bursts targeting one resource type | apiserver_request_total with code="410" and object churn rate |
| Direct etcd watch leak | etcd memory grows unbounded; all watches on the apiserver terminate abruptly | etcd watcher count and memory; verify no clients access etcd directly |
| HA resourceVersion divergence | Clients reconnecting to a different apiserver instance get 410 despite low churn | Compare /readyz and latency across individual instances |
| Client retry storm | Rapid WATCH connection churn with no clear control plane trigger | apiserver_longrunning_requests and audit logs for repeating clients |
Quick checks
# LIST request rate
kubectl get --raw='/metrics' | grep 'apiserver_request_total{' | grep 'verb="LIST"'
# Active watch connections
kubectl get --raw='/metrics' | grep 'apiserver_longrunning_requests{' | grep 'verb="WATCH"'
# 410 Gone rate
kubectl get --raw='/metrics' | grep 'apiserver_request_total{' | grep 'code="410"'
# Inflight request saturation
kubectl get --raw='/metrics' | grep '^apiserver_current_inflight_requests'
# APF queue depth
kubectl get --raw='/metrics' | grep '^apiserver_flowcontrol_current_inqueue_requests'
# etcd list latency
kubectl get --raw='/metrics' | grep '^etcd_request_duration_seconds.*list'
# LIST response sizes
kubectl get --raw='/metrics' | grep '^apiserver_response_sizes.*verb="LIST"'
# Watch cache population <!-- TODO: verify metric name -->
kubectl get --raw='/metrics' | grep '^apiserver_storage_cache_list_items'
# API server connection count on the node
ss -tnp | grep kube-apiserver | wc -l
How to diagnose it
Confirm the storm. Look for a spike in
apiserver_request_total{verb="LIST"}and elevatedapiserver_request_duration_secondsfor LIST. Correlate withapiserver_response_sizesto verify that large payloads are driving latency. If p99 response size jumps at the same time as p99 latency, serialization is the bottleneck.Identify the trigger. Check whether the spike follows an API server restart or rolling upgrade. Compare the API server pod start times against the LIST spike. If
/readyzreportsinformer-syncas failing, the watch cache is still warming. If there was no restart, look for a jump inapiserver_request_total{code="410"}indicating watch cache overflow.Check watch cache state. After a restart, watch cache population metrics will be near zero until the cache catches up. If the cache is warm but 410s persist, the cache window is likely too small for the rate of object churn.
Check etcd health. If
etcd_request_duration_secondsfor list operations is elevated while the watch cache is cold, etcd is serving the re-lists directly. Checketcd_disk_wal_fsync_duration_secondsandetcd_network_peer_round_trip_time_secondsto rule out disk or network latency. If etcd memory is climbing and watches are being terminated globally, investigate whether clients are accessing etcd directly outside the API server.Check for saturation. If
apiserver_current_inflight_requestsexceeds 80% of the configured--max-requests-inflightlimit, or if APF queues are rejecting requests, the storm is causing self-throttling. Checkapiserver_flowcontrol_rejected_requests_totalto see which priority levels are being throttled.Inspect client behavior. Use audit logs or API server logs to identify whether a specific controller, operator, or node agent is responsible for most LIST traffic. Look for the same user-agent or source IP re-listing repeatedly without exponential backoff. Check if the client is ignoring 410s and re-listing in a tight loop.
Verify HA consistency. In HA clusters, compare
/readyzand per-instance LIST latency across API server instances. If one instance returns 410s while others do not, or if its latency is an outlier, its watch cache may be lagging behind etcd. Check that your load balancer is not flipping clients between instances during reconnect storms.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| LIST request rate | Direct measure of re-list load | Sustained rate greater than 5x baseline |
| LIST latency p99 | Serialization and etcd scan cost | p99 greater than 5 seconds for core resources |
| LIST response sizes | Large objects amplify memory and CPU cost | Spikes correlating with latency |
| Active WATCH connections | Disconnects trigger re-lists | Sharp drop followed by rapid rebound |
| 410 Gone rate | Watch cache history exceeded or RV stale | Sustained increase after a restart |
| Inflight requests | Global concurrency saturation | Greater than 80% of --max-requests-inflight |
| APF queue depth | Priority traffic being delayed | Non-zero queues for system or leader-election levels |
| etcd list latency | Storage backend bottleneck | p99 greater than 200 ms |
| Watch cache object count | Cold cache forces etcd reads | Near zero after restart; flatline during growth |
| API server memory | LIST allocates temporary memory proportional to response size | RSS greater than 80% of container limit |
| Goroutine count | Connection churn or goroutine leaks | Growth without corresponding traffic increase |
Fixes
If the cause is a restart or rolling upgrade
Do not restart the API server again. Allow the watch cache to warm. Ensure the container memory limit has headroom for the re-list burst, which can temporarily spike RSS 2-3x above baseline. If the API server is crash-looping due to OOM, raise the memory limit before it restarts, and set GOMEMLIMIT to roughly 90% of the container limit so the Go runtime triggers garbage collection earlier.
If the cause is watch cache overflow
Increase --watch-cache-sizes for the affected resource type. The default cache capacity is set at startup and cannot be changed dynamically, so an API server restart is required to apply new values. Schedule the restart during a maintenance window or ensure remaining instances can absorb the shifted load. Before restarting, ensure the new capacity exceeds the peak object count for that resource; churn rate determines how quickly the cache cycles.
If the cause is direct etcd watches
Identify clients that are bypassing the API server watch cache, including direct etcd watch sessions. These watches hold unbounded buffers and can exhaust etcd’s watch window, starving all other watchers and forcing global termination. Reconfigure the client to use the API server or to adopt a current resourceVersion.
If the cause is HA resourceVersion divergence
Verify that your load balancer is not flipping clients between API server instances during reconnect storms. If one instance is consistently behind, investigate why its watch cache is not catching up. Temporarily reducing the number of serving instances can force clients onto a stable, synced instance while the lagging instance recovers. Warning: this concentrates load on the remaining instances and can worsen the storm. Use only if the remaining instances have confirmed CPU, memory, and network headroom.
If the cause is APF or inflight saturation
Raise --max-requests-inflight and --max-mutating-requests-inflight only if the host has CPU and memory headroom. Do not increase limits on an already memory-starved API server. Tune APF PriorityLevelConfiguration to give system and leader-election flows more concurrency shares so that bulk re-list traffic does not starve critical control plane loops.
Prevention
- Size API server memory limits for burst load, not steady state. Re-list storms can allocate temporary memory equal to several multiples of the response size.
- Tune
--watch-cache-sizesproactively as object counts grow. Monitorapiserver_storage_objectsper resource type to predict cache pressure. - Ensure APF flow schemas isolate bulk LIST traffic from system-critical traffic. Monitor
apiserver_flowcontrol_rejected_requests_total. - Audit controllers and operators for direct etcd WATCH behavior and for aggressive reconnection logic without exponential backoff.
- Use API server readiness checks in your load balancer to avoid routing traffic to instances with cold watch caches.
- Monitor etcd disk latency and leader stability. Slow etcd extends the time the watch cache takes to catch up after a restart.
How Netdata helps
- Correlate LIST latency, API server RSS, inflight requests, and etcd list latency on one timeline to distinguish cache pressure from etcd saturation.
- Track active WATCH connections and watch event rates to spot disconnections before they trigger re-lists.
- Alert on composite signals: LIST latency spikes combined with 410 rates and memory pressure.
Related guides
- Kubernetes API server memory pressure: OOM cycle and tuning
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server rate limiting: APF priority levels and starvation
- Kubernetes conntrack exhaustion: dropped connections under load






