Kubernetes etcd defragmentation: when, how, and what breaks
etcd’s database file grows over time even when you are not adding objects. Compaction removes old revisions logically, but the bbolt backend does not shrink the file on disk. Without defragmentation, the gap between physical file size and actual data footprint widens until the cluster hits a NOSPACE alarm and rejects all writes. This guide covers measuring that gap, sequencing defragmentation across an HA cluster without triggering leader elections, and the monitoring signals that predict when you need to act.
What this means
In Kubernetes, etcd stores every object revision in a bbolt database. When the API server compacts history, old revisions are marked as free space inside the file, but the filesystem block allocation remains. Defragmentation rewrites the database to reclaim that dead space. The operation is atomic and blocking: the targeted member stops serving reads and writes until the rewrite completes. In a three-member cluster, the remaining two members continue to serve traffic. The risk is timing. If the defragmented member is the leader and the operation exceeds the Raft election timeout, followers start an election. That causes a brief API server outage and a latency spike for mutating requests. The API server sends every write to etcd synchronously, so latency on the etcd member it is talking to shows up as API latency.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Compaction without defragmentation | etcd_mvcc_db_total_size_in_bytes grows while in-use size stays flat | Ratio of total to in-use DB size |
| High object churn | Rapid etcd_mvcc_revision increases; events or leases dominate | apiserver_storage_objects by resource type |
| Missing auto-compaction | Linear DB growth with no periodic flattening | etcd --auto-compaction-retention and API server --etcd-compaction-interval |
| Low default quota | Default etcd quota is 2 GB; many bootstrap tools do not override it | etcd_mvcc_db_total_size_in_bytes vs --quota-backend-bytes |
Quick checks
Production etcd clusters use TLS. The etcdctl and curl commands below omit --cacert, --cert, --key, and scheme flags for brevity. Supply them or set the corresponding environment variables before running these against a real cluster.
- Check DB size and leader status per member.
ETCDCTL_API=3 etcdctl endpoint status --cluster -w table - Compare total and in-use size from metrics.Good: ratio below 2:1. Bad: ratio above 2:1 or total size above 80% of quota.
curl -s http://localhost:2379/metrics | grep -E 'etcd_mvcc_db_total_size_in_bytes|etcd_mvcc_db_total_size_in_use_in_bytes' - Check for active alarms.
ETCDCTL_API=3 etcdctl alarm list - Verify auto-compaction is configured.
# Check etcd auto-compaction settings (output may be truncated in ps; fall back to the static Pod manifest if needed) ps aux | grep etcd | grep -o '\-\-auto-compaction-[^ ]*' - Verify API server compaction interval.
# Check API server-side compaction interval ps aux | grep kube-apiserver | grep -o '\-\-etcd-compaction-interval=[^ ]*' - Check WAL fsync latency before blocking a member.p99 above 100 ms means defragmentation will take longer and election risk is higher.
# WAL fsync is a histogram; query your metrics backend for p99, or inspect upper buckets curl -s http://localhost:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds - Identify high-churn resource types.
# Find which resource types are consuming the most storage kubectl get --raw /metrics | grep '^apiserver_storage_objects' - Check etcd metrics for signs of stress.
# Look for failed proposals or slow applies curl -s http://localhost:2379/metrics | grep -E 'etcd_server_proposals_failed_total|etcd_server_slow_apply_total'
How to diagnose it
- Confirm physical bloat. Query
etcd_mvcc_db_total_size_in_bytesandetcd_mvcc_db_total_size_in_use_in_bytes. If the ratio is greater than 2:1, the database is fragmented. - Verify compaction is actually running. Check etcd flags for
--auto-compaction-modeand--auto-compaction-retention, and the API server for--etcd-compaction-interval. If these are absent or mismatched, compaction is not reclaiming logical space and the file will grow indefinitely. - Assess disk health before maintenance. Query p99 of
etcd_disk_wal_fsync_duration_secondsfrom your metrics backend. Values above 100 ms indicate the disk is already slow; defragmentation will extend the member’s blocking window and increase the chance of a leader election timeout. - Validate quorum health. Do not defragment if any member is unhealthy or if the cluster is not at full strength. A two-member cluster cannot tolerate another member being offline.
- Locate the leader. Use
etcdctl endpoint status --cluster -w table. Plan to defragment followers first, then move leadership to a follower, and defragment the former leader last. - Estimate the maintenance window. Duration depends on database size and disk throughput. If the defragmented member is the leader and the blocking window exceeds the Raft election timeout, followers start an election.
flowchart TD
A[Fragmentation ratio > 2:1 or quota > 80%] --> B{Cluster is 3-member HA?}
B -- No --> C[Defrag single member
during maintenance window]
B -- Yes --> D[Identify leader
etcdctl endpoint status]
D --> E[Defrag follower 1]
E --> F[Wait for recovery]
F --> G[Defrag follower 2]
G --> H[Move leadership
to a follower]
H --> I[Defrag former leader]
I --> J[Verify endpoint health
and DB size]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
etcd_mvcc_db_total_size_in_bytes vs etcd_mvcc_db_total_size_in_use_in_bytes | Measures physical bloat vs actual data | Ratio > 2:1 or total size > 80% of quota |
etcd_disk_wal_fsync_duration_seconds | Defrag is I/O intensive; slow disks extend the blocking window | p99 > 100 ms before maintenance |
etcd_server_leader_changes_seen_total | Defragmenting the leader without stepping it down can trigger election | Any increase during a maintenance window |
etcd_server_has_leader | Confirms the member recovered after being blocked | = 0 after maintenance |
apiserver_request_duration_seconds (mutating verbs) | A member blocking on defrag or a leader election raises API write latency | p99 > 1 s sustained during defrag |
etcd_mvcc_db_total_size_in_use_in_bytes growth rate | Even after defrag, unchecked growth brings you back to quota pressure | > 100 MB/day without corresponding workload growth |
apiserver_storage_objects (events) | Events have a 1-hour TTL but can outpace cleanup and fill etcd | Events count > 50,000 or growing unboundedly |
Fixes
If the cause is fragmentation buildup
Compact current revisions if auto-compaction is not enabled, then defragment one member at a time. On etcd 3.4 and 3.5, run:
# Defragment a single member (blocking operation)
ETCDCTL_API=3 etcdctl defrag --endpoints=https://<member-ip>:2379
Sequence: defragment follower 1, wait for it to recover, defragment follower 2, move leadership to a follower, then defragment the former leader. Do not defragment the leader first.
In etcd 3.6, etcdctl defrag may be removed; use etcdutl for offline defragmentation or the distribution-specific live-defrag tooling.
If the target is the current leader
Move leadership to a follower before defragmenting the former leader. The exact transfer mechanism depends on your etcd version and distribution tooling, but the rule is absolute: never defragment the leader first in a busy cluster. If the leader is defragmented without stepping it down and the operation exceeds the election timeout, the cluster will elect a new leader mid-operation and API server mutating latency will spike.
If the cause is a NOSPACE alarm
When etcd raises a NOSPACE alarm it rejects all writes. Recovery is:
# Compact to the current revision
ETCDCTL_API=3 etcdctl compact $(etcdctl endpoint status --write-out=json | jq -r '.[0].Status.header.revision')
# Defragment each member sequentially
ETCDCTL_API=3 etcdctl defrag --endpoints=https://<member-ip>:2379
# Disarm the alarm after space is freed
ETCDCTL_API=3 etcdctl alarm disarm
If the database is still near the limit after cleanup, increase --quota-backend-bytes, but only after you have addressed the root cause of growth.
If the cause is slow disk extending defrag time
Schedule defragmentation during a maintenance window. If etcd_disk_wal_fsync_duration_seconds p99 is above 100 ms, treat the disk as a risk factor. Move etcd data to a dedicated local SSD or NVMe volume. Network-attached storage is not suitable for etcd defragmentation at scale.
Prevention
- Align compaction settings. Ensure etcd runs with
--auto-compaction-mode=periodicand--auto-compaction-retentionset appropriately, and verify that the API server’s--etcd-compaction-intervaldoes not conflict. - Monitor the fragmentation ratio. Alert when total DB size exceeds 80% of
--quota-backend-bytes, or when the ratio of total to in-use size exceeds 2:1. - Control object churn. Reduce event TTL if your cluster generates high event volume, and enforce cleanup of completed Jobs and orphaned CRD instances.
- Do not double-schedule defragmentation. Some distributions handle defragmentation via an operator. Verify whether your distribution automates this before adding external jobs.
- Size quota for production. The default etcd quota of 2 GB is often too small for production workloads. Increase
--quota-backend-bytesbefore you need it. - Run defragmentation during low-traffic windows. This minimizes the impact of the brief member unavailability and reduces the chance that API server retries amplify load during the maintenance.
How Netdata helps
- Surfaces
etcd_mvcc_db_total_size_in_bytesandetcd_mvcc_db_total_size_in_use_in_bytesside-by-side so you can see the fragmentation ratio without manual curls. - Correlates
etcd_disk_wal_fsync_duration_secondswithapiserver_request_duration_secondson the same timeline, making it obvious when a defrag or compaction spike propagated to the API server. - Tracks
etcd_server_has_leaderandetcd_server_leader_changes_seen_totalso you can confirm that maintenance did not destabilize the Raft consensus. - Alerts on quota utilization and on sustained etcd latency, giving you a buffer to act before the
NOSPACEalarm fires.
Related guides
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes API server memory pressure: OOM cycle and tuning
- Kubernetes API server rate limiting: APF priority levels and starvation
- Kubernetes controller-manager leader election failures






