Kubernetes etcd snapshot failures: backup, restore, and verification

An etcd snapshot failure usually surfaces during an incident, not during backup. A snapshot from an unhealthy member, a corrupted transfer to object storage, or a restore that writes no data renders disaster recovery useless. This guide gives the checks, commands, and decision logic to verify snapshot integrity, fix backup failures, and perform clean restores.

What this means

An etcd snapshot captures the entire key-value store at a point in time. Because committed Raft log entries exist on a majority of members, a snapshot from any healthy member contains the full cluster state. The snapshot file includes a SHA-256 hash computed at save time. If that hash does not match after transfer, or if the snapshot is taken while etcd is under NOSPACE alarm or leader instability, the file may be inconsistent.

Restoring a snapshot rebuilds a new data directory offline. Every restored member must start with the same --initial-cluster-token and --initial-cluster topology. For Kubernetes, a restored etcd can serve stale revision data to API server watchers unless the revision counter is bumped. etcd 3.6 removed etcdctl snapshot restore; use etcdutl instead.

Common causes

CauseWhat it looks likeFirst thing to check
Snapshot corruption in transitetcdutl snapshot status reports integrity or CRC errorsSHA-256 hash or re-run status after download
NOSPACE alarm blocking writesetcdctl snapshot save fails or reflects inconsistent stateetcdctl alarm list for NOSPACE
etcd 3.6 tooling mismatchetcdctl snapshot restore returns “command not found” or deprecation fataletcdctl version and whether etcdutl is present
Missing etcdutl in container imageRestore scripts fail inside the etcd static podBinary presence in the container image or host path
Single-member HA restore with mismatched flagsCluster fails to start or enters split-brain after restore--initial-cluster-token and --initial-cluster consistency across nodes
Raw db file copied without hashetcdutl snapshot restore fails with hash mismatchFile provenance: was it from snapshot save or member/snap/

Quick checks

# Check local etcd member health
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health
# Check DB size and alarms across the cluster
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --cluster --write-out=table
ETCDCTL_API=3 etcdctl alarm list
# Verify snapshot integrity after creation or download
etcdutl snapshot status /path/to/snapshot.db
# Expected: HASH, REVISION, TOTAL KEYS, TOTAL SIZE, STORAGE VERSION
# Check which etcd version and tools are available
etcdctl version
etcdutl version
# For kubeadm static pods, check if etcdutl is in the image
crictl exec $(crictl ps --name etcd -q) etcdutl version
# Compare local hash to source after off-host transfer
sha256sum /path/to/snapshot.db
# Defragment to free space (blocks writes briefly per member)
# WARNING: run during a maintenance window
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  defrag --cluster

How to diagnose it

  1. Confirm the source member is healthy. A snapshot from a lagging or partitioned member may contain inconsistent data. Run etcdctl endpoint health against the member, then etcdctl endpoint status --cluster --write-out=table. The leader should be stable, and followers should show minimal lag in the RAFT TERM and RAFT INDEX columns. If the member is unhealthy, switch to another member.
  2. Check for active alarms. etcdctl alarm list must return nothing. If NOSPACE is active, etcd is read-only. No write-dependent operation, including a consistent snapshot, will succeed. Compact old revisions and defragment the cluster before retrying.
  3. Validate tooling against etcd version. If you are on etcd 3.5.x, etcdctl snapshot restore works but emits deprecation warnings. If you are on etcd 3.6.0+, the command is removed and etcdutl is required. Verify binary availability before an incident.
  4. Verify integrity after every transfer. etcdutl snapshot status performs a cryptographic integrity check. Run it immediately after creation and again after downloading from S3 or another store. If the hash changed, the file is corrupt. Do not proceed with a restore.
  5. Inspect restore behavior for silent data loss. If the restore reports success but the restored --data-dir lacks db files, the snapshot was not written. Prefer etcdutl snapshot restore directly to avoid delegation bugs in older etcdctl versions.
  6. Validate HA restore topology. For multi-member restores, every node must use the same --initial-cluster-token and the same --initial-cluster list. Each node uses its own --name and --initial-advertise-peer-urls, but the token and member list must be identical. Mismatches cause quorum loss or split-brain.
flowchart TD
    A[Snapshot fails or restore fails] --> B{etcdctl alarm list shows NOSPACE?}
    B -->|Yes| C[Compact and defrag, then retry]
    B -->|No| D{etcdutl snapshot status fails integrity?}
    D -->|Yes| E[Re-take snapshot and verify transfer hash]
    D -->|No| F{Restore fails or data missing?}
    F -->|Yes| G{etcd version >= 3.6?}
    G -->|Yes| H[Use etcdutl snapshot restore with --bump-revision and --mark-compacted]
    G -->|No| I[Use etcdutl snapshot restore directly avoid etcdctl bug]
    F -->|No| J[Verify --initial-cluster and --initial-cluster-token match across all HA members]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
etcd_mvcc_db_total_size_in_bytesApproaching quota causes NOSPACE and blocks snapshotsDB size > 75% of --quota-backend-bytes
etcd_disk_wal_fsync_duration_secondsHigh fsync latency indicates disk pressure that can corrupt or slow snapshotsp99 > 100ms sustained
etcd_server_leader_changes_seen_totalFrequent leader changes mean the cluster is unstable; snapshots taken during churn may be inconsistent> 0 per hour outside maintenance
etcd_server_has_leaderA member without a leader cannot guarantee a consistent snapshotGauge == 0
API server 5xx rateetcd distress propagates as API server write failuresSustained apiserver_request_total{code=~"5.."} > 0.1%
Snapshot file hash mismatchConfirms corruption during transfer or storageetcdutl snapshot status hash differs from source
etcdutl binary presenceetcd 3.6 requires etcdutl for restore and statusBinary missing from host or container image

Fixes

If the snapshot is corrupted or fails integrity checks

Re-take the snapshot from a healthy member. Do not attempt to restore a file that fails etcdutl snapshot status. If corruption occurs during transfer, verify network paths and compare SHA-256 sums at source and destination.

If NOSPACE blocks the backup

Run compaction and defragmentation during a low-traffic window.

REV=$(ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --write-out=json | jq -r '.[0].Status.header.revision')

# WARNING: compaction permanently removes all historical revisions
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  compact "$REV"

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  defrag --cluster

After resolving the space issue, disarm the alarm explicitly:

ETCDCTL_API=3 etcdctl alarm disarm

Then retry the snapshot.

If etcd 3.6 tooling is missing

Migrate all restore and status scripts to etcdutl. If you run etcd as a kubeadm static pod and the container image does not include etcdutl, download the matching etcd release tarball on the host and run etcdutl from the host path.

If restore fails silently or writes no data

Use etcdutl snapshot restore directly instead of delegating through etcdctl.

If HA restore causes split-brain or quorum loss

Restore each member independently with its own --name and --initial-advertise-peer-urls, but pass the exact same --initial-cluster and --initial-cluster-token on every node. For etcd 3.6, --initial-cluster-token is mandatory. Do not restore to one member and attempt to rejoin the others with the old topology.

If restoring to a Kubernetes cluster

Include revision bump flags to invalidate stale watcher caches:

etcdutl snapshot restore snapshot.db \
  --bump-revision 1000000000 \
  --mark-compacted \
  --data-dir /var/lib/etcd

Without this, API server informers may serve pre-snapshot state.

If raw db files lack a hash

Snapshots taken with etcdctl snapshot save include a hash. Raw database files copied from member/snap/ do not. If you must restore from a raw file, pass --skip-hash-check to etcdutl snapshot restore. This bypasses integrity verification and is a last resort.

Prevention

  • Automate snapshots and verify them. Schedule etcdctl snapshot save via a systemd timer or Kubernetes CronJob, then immediately run etcdutl snapshot status. Push the snapshot to object storage and re-verify after download.
  • Monitor DB size trends. Track etcd_mvcc_db_total_size_in_bytes against your quota. Enable automatic compaction and schedule periodic defragmentation so the DB does not approach the alarm threshold.
  • Keep tooling current. etcd 3.6 removed restore and status from etcdctl. Update runbooks and automation to use etcdutl before you need it.
  • Test restores quarterly. A verified snapshot is only useful if the restore procedure works. Perform a full restore to a temporary environment or isolated nodes to confirm flag correctness and revision bump behavior.
  • Protect static pod manifests. On kubeadm clusters, moving manifests out of /etc/kubernetes/manifests/ is required before stopping etcd for a restore. Document this step in your runbook so the kubelet does not restart etcd prematurely.

How Netdata helps

Use Netdata to correlate the following before scheduling snapshots:

  • etcd_mvcc_db_total_size_in_bytes against compaction schedules to predict the next NOSPACE alarm.
  • etcd_disk_wal_fsync_duration_seconds spikes to avoid I/O-bound backup windows.
  • etcd_server_leader_changes_seen_total to skip periods of Raft instability.
  • API server 5xx rate and latency as downstream indicators of etcd distress that can degrade snapshot consistency.