Kubernetes API server etcd latency: detection and cascading failures

When etcd slows down, the entire control plane slows with it. A few extra milliseconds on disk fsync turns into hung kubectl commands, backed-up controller queues, and eventually a cluster that cannot schedule pods or update endpoints. Detect the etcd latency cascade, confirm whether storage is the root cause, and break the feedback loop before the cluster becomes effectively read-only.

What this means

etcd serializes every Kubernetes mutation. Every API server write becomes a Raft proposal that must fsync to the WAL before etcd acknowledges it. When the disk under etcd is slow, every fsync waits longer. The API server holds mutating requests open until etcd responds. Requests pile up in the inflight queue. Once the queue hits the limit, the API server returns 429 Too Many Requests. Controllers that depend on writes (scheduler, replica set controller, and others) fall behind and retry. Retries generate more write load. The result is a feedback loop: slow disk -> slow etcd -> slow API server -> retry storm -> amplified etcd load.

The failure is asymmetric. Read operations served from the API server watch cache may still respond quickly, so kubectl get can look healthy while kubectl create or kubectl delete hangs. This asymmetry makes the cascade easy to misdiagnose as an API server problem rather than a storage problem.

Common causes

CauseWhat it looks likeFirst thing to check
Disk I/O saturation on the etcd hostWAL fsync p99 climbing above 10ms; leader election stormsiostat -x 1 on etcd nodes
Network-attached storage latencyVariable fsync spikes; cloud burst credit exhaustionDisk type and burst balance
etcd database approaching quotaDB size near 80% of the default 2GB; writes fail with NOSPACE alarmetcdctl endpoint status --write-out=table
etcd compaction or defragmentationPeriodic latency spikes aligned with maintenance windowsetcd logs for “compact” or “defrag”
Network partition between API server and etcdUniform mutating latency elevation; API server readyz etcd check failsetcdctl endpoint health and peer RTT

Quick checks

Run these in order. All are read-only.

# Check etcd WAL fsync latency (etcd metrics endpoint)
curl -s http://localhost:2379/metrics | grep ^etcd_disk_wal_fsync_duration_seconds

# Check etcd backend commit latency
curl -s http://localhost:2379/metrics | grep ^etcd_disk_backend_commit_duration_seconds

# Check etcd cluster health and member status
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health

# Check etcd DB size, leader status, and Raft index
ETCDCTL_API=3 etcdctl endpoint status --cluster -w table \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Check API server readyz etcd sub-check
kubectl get --raw='/readyz?verbose' | grep -A2 etcd

# Check disk I/O wait on the etcd host
iostat -x 1 5

# Check API server inflight mutating requests
kubectl get --raw='/metrics' | grep ^apiserver_current_inflight_requests

# Check 429 rejection rate
kubectl get --raw='/metrics' | grep 'apiserver_request_total' | grep 'code="429"'

# Check etcd leader changes
curl -s http://localhost:2379/metrics | grep ^etcd_server_leader_changes_seen_total

# Check pending Raft proposals
curl -s http://localhost:2379/metrics | grep ^etcd_server_proposals_pending

How to diagnose it

Confirm the cascade and find the root cause.

  1. Confirm mutating API latency is elevated. Check apiserver_request_duration_seconds for POST, PUT, and PATCH verbs. If p99 is above 500ms sustained, the control plane is degrading. If it is above 1s, the cluster is in active failure.

  2. Check etcd WAL fsync latency. Look at etcd_disk_wal_fsync_duration_seconds on the etcd metrics endpoint. In a healthy cluster, p99 is below 10ms. Above 100ms is critical. This is the root cause signal. If it is elevated, the problem is under etcd, not in the API server.

  3. Check etcd leader stability. Look at etcd_server_leader_changes_seen_total. In a stable cluster, this should be near zero. If it is incrementing, the etcd leader is missing heartbeats because disk latency is exceeding the default 100ms heartbeat interval or the 1000ms election timeout.

  4. Check etcd database size versus quota. Run etcdctl endpoint status --write-out=table. Compare DB SIZE to the configured --quota-backend-bytes (default 2GB). If the database is above 80% of quota, etcd is approaching the NOSPACE alarm, which makes writes progressively slower and eventually stops them entirely.

  5. Check API server inflight requests and 429 rate. Look at apiserver_current_inflight_requests and apiserver_request_total{code="429"}. If inflight is climbing toward the limit (default 200 mutating, 400 read-only) and 429s are appearing, the API server is saturated because it is waiting on etcd.

  6. Check disk I/O on the etcd host. Run iostat -x 1 and look for high %util, elevated await, or queue depth near the device limit. If disk utilization is near 100%, the storage subsystem is the bottleneck. If the disk is network-attached, check for burst credit exhaustion.

  7. Distinguish from admission webhook slowdown. Check apiserver_admission_webhook_admission_duration_seconds . If webhook latency is normal while mutating API latency is high, etcd is the culprit. If webhook latency is also elevated, the bottleneck may be a slow webhook instead.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
etcd_disk_wal_fsync_duration_secondsWAL fsync is on the critical path of every writep99 > 10ms trending upward
etcd_disk_backend_commit_duration_secondsBackend commit affects read performance and compactionp99 > 25ms sustained
etcd_request_duration_secondsAPI server’s client-side view of etcd latencyp99 > 100ms for writes
apiserver_request_duration_seconds (mutating verbs)End-to-end latency of writes through the API serverp99 > 500ms sustained
etcd_server_leader_changes_seen_totalLeader elections cause brief write outagesAny increase in a stable cluster
etcd_mvcc_db_total_size_in_bytesApproaching quota causes write rejection> 50% of --quota-backend-bytes
apiserver_current_inflight_requestsIndicates API server saturation> 80% of configured limit
apiserver_request_total{code="429"}Confirms APF or inflight saturationSustained rate above zero
etcd_server_proposals_pendingRising value means Raft cannot reach consensus fast enoughValue increasing over time
etcd_network_peer_round_trip_time_secondsHigh peer RTT causes leader instabilityp99 > 1ms between peers

Fixes

If the cause is disk I/O saturation

Identify competing I/O workloads on the etcd host. If etcd is stacked with the API server or with logging agents, move etcd to dedicated SSD or NVMe storage. Do not run etcd on network-attached storage in production. If the disk is degraded, fail over to another etcd member if one is available.

If the cause is database size or fragmentation

Compact old revisions with etcdctl compact (requires a target revision; check etcdctl endpoint status) and then defragment one member at a time (followers first, leader last). Defragmentation blocks the member and can cause latency spikes. After freeing space, disarm any active NOSPACE alarm with etcdctl alarm disarm. Consider increasing --quota-backend-bytes if the cluster legitimately needs more than 2GB.

If the cause is periodic compaction or defragmentation

Compaction causes predictable latency spikes. Ensure the Kubernetes API server --etcd-compaction-interval and etcd’s own --auto-compaction-retention are aligned and not conflicting. Schedule defragmentation during maintenance windows, not during peak load.

If the cause is network latency or partition

Check etcd_network_peer_round_trip_time_seconds between members. If RTT is above 1ms, investigate the network path. Ensure etcd members are deployed with odd cardinality (3 or 5) so the cluster can tolerate member loss without losing quorum. If the API server cannot reach etcd, verify network policies, firewalls, and certificate validity on the etcd client paths.

Prevention

  • Monitor etcd_disk_wal_fsync_duration_seconds with the same urgency as API server latency. Alert when p99 exceeds 10ms.
  • Keep etcd database size below 50% of quota. Track the trend and schedule compaction before reaching 75%.
  • Run etcd on dedicated local SSD or NVMe. Never share the disk with workloads, logging, or the API server if stacked.
  • Alert on any etcd leader change in a stable cluster. Even one per hour indicates disk or network stress.
  • Ensure client certificate rotation is working. Expired etcd client or peer certificates can appear as latency or connectivity failures.
  • Size API server inflight limits and APF concurrency shares to leave headroom for bursts. Sustained utilization above 50% of inflight capacity should trigger capacity review.

How Netdata helps

  • Correlate etcd_disk_wal_fsync_duration_seconds with apiserver_request_duration_seconds on the same timeline to confirm the cascade.
  • Track apiserver_current_inflight_requests and 429 rates alongside etcd metrics to watch saturation build before an outage.
  • Monitor disk I/O wait, utilization, and queue depth on etcd nodes to distinguish disk saturation from application-level slowdown.
  • Alert on etcd leader changes and database size trends.
flowchart TD
    A[Slow disk I/O] --> B[etcd WAL fsync delay]
    B --> C[Leader misses heartbeat]
    C --> D[Raft leader election]
    D --> E[Brief write unavailability]
    E --> F[API server mutating requests timeout]
    F --> G[Inflight requests accumulate]
    G --> H[429 Too Many Requests]
    H --> I[Controllers retry]
    I --> J[Amplified write load on etcd]
    J --> B