$ guides / kubernetes / kubernetes-etcd-defragmentation ▌

Operations Guides

Kubernetes etcd defragmentation: when, how, and what breaks

etcd’s database file grows over time even when you are not adding objects. Compaction removes old revisions logically, but the bbolt backend does not shrink the file on disk. Without defragmentation, the gap between physical file size and actual data footprint widens until the cluster hits a NOSPACE alarm and rejects all writes. This guide covers measuring that gap, sequencing defragmentation across an HA cluster without triggering leader elections, and the monitoring signals that predict when you need to act.

What this means

In Kubernetes, etcd stores every object revision in a bbolt database. When the API server compacts history, old revisions are marked as free space inside the file, but the filesystem block allocation remains. Defragmentation rewrites the database to reclaim that dead space. The operation is atomic and blocking: the targeted member stops serving reads and writes until the rewrite completes. In a three-member cluster, the remaining two members continue to serve traffic. The risk is timing. If the defragmented member is the leader and the operation exceeds the Raft election timeout, followers start an election. That causes a brief API server outage and a latency spike for mutating requests. The API server sends every write to etcd synchronously, so latency on the etcd member it is talking to shows up as API latency.

Common causes

Cause	What it looks like	First thing to check
Compaction without defragmentation	`etcd_mvcc_db_total_size_in_bytes` grows while in-use size stays flat	Ratio of total to in-use DB size
High object churn	Rapid `etcd_mvcc_revision` increases; events or leases dominate	`apiserver_storage_objects` by resource type
Missing auto-compaction	Linear DB growth with no periodic flattening	etcd `--auto-compaction-retention` and API server `--etcd-compaction-interval`
Low default quota	Default etcd quota is 2 GB; many bootstrap tools do not override it	`etcd_mvcc_db_total_size_in_bytes` vs `--quota-backend-bytes`

Quick checks

Production etcd clusters use TLS. The etcdctl and curl commands below omit --cacert, --cert, --key, and scheme flags for brevity. Supply them or set the corresponding environment variables before running these against a real cluster.

Check DB size and leader status per member.

ETCDCTL_API=3 etcdctl endpoint status --cluster -w table

Compare total and in-use size from metrics.

curl -s http://localhost:2379/metrics | grep -E 'etcd_mvcc_db_total_size_in_bytes|etcd_mvcc_db_total_size_in_use_in_bytes'

Good: ratio below 2:1. Bad: ratio above 2:1 or total size above 80% of quota.

Check for active alarms.
```
ETCDCTL_API=3 etcdctl alarm list
```

Verify auto-compaction is configured.

# Check etcd auto-compaction settings (output may be truncated in ps; fall back to the static Pod manifest if needed)
ps aux | grep etcd | grep -o '\-\-auto-compaction-[^ ]*'

Verify API server compaction interval.

# Check API server-side compaction interval
ps aux | grep kube-apiserver | grep -o '\-\-etcd-compaction-interval=[^ ]*'

Check WAL fsync latency before blocking a member.

# WAL fsync is a histogram; query your metrics backend for p99, or inspect upper buckets
curl -s http://localhost:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds

p99 above 100 ms means defragmentation will take longer and election risk is higher.

Identify high-churn resource types.

# Find which resource types are consuming the most storage
kubectl get --raw /metrics | grep '^apiserver_storage_objects'

Check etcd metrics for signs of stress.

# Look for failed proposals or slow applies
curl -s http://localhost:2379/metrics | grep -E 'etcd_server_proposals_failed_total|etcd_server_slow_apply_total'

How to diagnose it

Confirm physical bloat. Query etcd_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_bytes. If the ratio is greater than 2:1, the database is fragmented.
Verify compaction is actually running. Check etcd flags for --auto-compaction-mode and --auto-compaction-retention, and the API server for --etcd-compaction-interval. If these are absent or mismatched, compaction is not reclaiming logical space and the file will grow indefinitely.
Assess disk health before maintenance. Query p99 of etcd_disk_wal_fsync_duration_seconds from your metrics backend. Values above 100 ms indicate the disk is already slow; defragmentation will extend the member’s blocking window and increase the chance of a leader election timeout.
Validate quorum health. Do not defragment if any member is unhealthy or if the cluster is not at full strength. A two-member cluster cannot tolerate another member being offline.
Locate the leader. Use etcdctl endpoint status --cluster -w table. Plan to defragment followers first, then move leadership to a follower, and defragment the former leader last.
Estimate the maintenance window. Duration depends on database size and disk throughput. If the defragmented member is the leader and the blocking window exceeds the Raft election timeout, followers start an election.

flowchart TD
    A[Fragmentation ratio > 2:1 or quota > 80%] --> B{Cluster is 3-member HA?}
    B -- No --> C[Defrag single member
during maintenance window]
    B -- Yes --> D[Identify leader
etcdctl endpoint status]
    D --> E[Defrag follower 1]
    E --> F[Wait for recovery]
    F --> G[Defrag follower 2]
    G --> H[Move leadership
to a follower]
    H --> I[Defrag former leader]
    I --> J[Verify endpoint health
and DB size]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`etcd_mvcc_db_total_size_in_bytes` vs `etcd_mvcc_db_total_size_in_use_in_bytes`	Measures physical bloat vs actual data	Ratio > 2:1 or total size > 80% of quota
`etcd_disk_wal_fsync_duration_seconds`	Defrag is I/O intensive; slow disks extend the blocking window	p99 > 100 ms before maintenance
`etcd_server_leader_changes_seen_total`	Defragmenting the leader without stepping it down can trigger election	Any increase during a maintenance window
`etcd_server_has_leader`	Confirms the member recovered after being blocked	= 0 after maintenance
`apiserver_request_duration_seconds` (mutating verbs)	A member blocking on defrag or a leader election raises API write latency	p99 > 1 s sustained during defrag
`etcd_mvcc_db_total_size_in_use_in_bytes` growth rate	Even after defrag, unchecked growth brings you back to quota pressure	> 100 MB/day without corresponding workload growth
`apiserver_storage_objects` (events)	Events have a 1-hour TTL but can outpace cleanup and fill etcd	Events count > 50,000 or growing unboundedly

Fixes

If the cause is fragmentation buildup

Compact current revisions if auto-compaction is not enabled, then defragment one member at a time. On etcd 3.4 and 3.5, run:

# Defragment a single member (blocking operation)
ETCDCTL_API=3 etcdctl defrag --endpoints=https://<member-ip>:2379

Sequence: defragment follower 1, wait for it to recover, defragment follower 2, move leadership to a follower, then defragment the former leader. Do not defragment the leader first.

In etcd 3.6, etcdctl defrag may be removed; use etcdutl for offline defragmentation or the distribution-specific live-defrag tooling.

If the target is the current leader

Move leadership to a follower before defragmenting the former leader. The exact transfer mechanism depends on your etcd version and distribution tooling, but the rule is absolute: never defragment the leader first in a busy cluster. If the leader is defragmented without stepping it down and the operation exceeds the election timeout, the cluster will elect a new leader mid-operation and API server mutating latency will spike.

If the cause is a NOSPACE alarm

When etcd raises a NOSPACE alarm it rejects all writes. Recovery is:

# Compact to the current revision
ETCDCTL_API=3 etcdctl compact $(etcdctl endpoint status --write-out=json | jq -r '.[0].Status.header.revision')

# Defragment each member sequentially
ETCDCTL_API=3 etcdctl defrag --endpoints=https://<member-ip>:2379

# Disarm the alarm after space is freed
ETCDCTL_API=3 etcdctl alarm disarm

If the database is still near the limit after cleanup, increase --quota-backend-bytes, but only after you have addressed the root cause of growth.

If the cause is slow disk extending defrag time

Schedule defragmentation during a maintenance window. If etcd_disk_wal_fsync_duration_seconds p99 is above 100 ms, treat the disk as a risk factor. Move etcd data to a dedicated local SSD or NVMe volume. Network-attached storage is not suitable for etcd defragmentation at scale.

Prevention

Align compaction settings. Ensure etcd runs with --auto-compaction-mode=periodic and --auto-compaction-retention set appropriately, and verify that the API server’s --etcd-compaction-interval does not conflict.
Monitor the fragmentation ratio. Alert when total DB size exceeds 80% of --quota-backend-bytes, or when the ratio of total to in-use size exceeds 2:1.
Control object churn. Reduce event TTL if your cluster generates high event volume, and enforce cleanup of completed Jobs and orphaned CRD instances.
Do not double-schedule defragmentation. Some distributions handle defragmentation via an operator. Verify whether your distribution automates this before adding external jobs.
Size quota for production. The default etcd quota of 2 GB is often too small for production workloads. Increase --quota-backend-bytes before you need it.
Run defragmentation during low-traffic windows. This minimizes the impact of the brief member unavailability and reduces the chance that API server retries amplify load during the maintenance.

How Netdata helps

Surfaces etcd_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_bytes side-by-side so you can see the fragmentation ratio without manual curls.
Correlates etcd_disk_wal_fsync_duration_seconds with apiserver_request_duration_seconds on the same timeline, making it obvious when a defrag or compaction spike propagated to the API server.
Tracks etcd_server_has_leader and etcd_server_leader_changes_seen_total so you can confirm that maintenance did not destabilize the Raft consensus.
Alerts on quota utilization and on sustained etcd latency, giving you a buffer to act before the NOSPACE alarm fires.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes etcd defragmentation: when, how, and what breaks

Kubernetes etcd defragmentation: when, how, and what breaks

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is fragmentation buildup

If the target is the current leader

If the cause is a NOSPACE alarm

If the cause is slow disk extending defrag time

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata