$ guides / kubernetes / kubernetes-etcd-snapshot-failures ▌

Operations Guides

Kubernetes etcd snapshot failures: backup, restore, and verification

An etcd snapshot failure usually surfaces during an incident, not during backup. A snapshot from an unhealthy member, a corrupted transfer to object storage, or a restore that writes no data renders disaster recovery useless. This guide gives the checks, commands, and decision logic to verify snapshot integrity, fix backup failures, and perform clean restores.

What this means

An etcd snapshot captures the entire key-value store at a point in time. Because committed Raft log entries exist on a majority of members, a snapshot from any healthy member contains the full cluster state. The snapshot file includes a SHA-256 hash computed at save time. If that hash does not match after transfer, or if the snapshot is taken while etcd is under NOSPACE alarm or leader instability, the file may be inconsistent.

Restoring a snapshot rebuilds a new data directory offline. Every restored member must start with the same --initial-cluster-token and --initial-cluster topology. For Kubernetes, a restored etcd can serve stale revision data to API server watchers unless the revision counter is bumped. etcd 3.6 removed etcdctl snapshot restore; use etcdutl instead.

Common causes

Cause	What it looks like	First thing to check
Snapshot corruption in transit	`etcdutl snapshot status` reports integrity or CRC errors	SHA-256 hash or re-run status after download
NOSPACE alarm blocking writes	`etcdctl snapshot save` fails or reflects inconsistent state	`etcdctl alarm list` for `NOSPACE`
etcd 3.6 tooling mismatch	`etcdctl snapshot restore` returns “command not found” or deprecation fatal	`etcdctl version` and whether `etcdutl` is present
Missing `etcdutl` in container image	Restore scripts fail inside the etcd static pod	Binary presence in the container image or host path
Single-member HA restore with mismatched flags	Cluster fails to start or enters split-brain after restore	`--initial-cluster-token` and `--initial-cluster` consistency across nodes
Raw db file copied without hash	`etcdutl snapshot restore` fails with hash mismatch	File provenance: was it from `snapshot save` or `member/snap/`

Quick checks

# Check local etcd member health
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health

# Check DB size and alarms across the cluster
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --cluster --write-out=table
ETCDCTL_API=3 etcdctl alarm list

# Verify snapshot integrity after creation or download
etcdutl snapshot status /path/to/snapshot.db
# Expected: HASH, REVISION, TOTAL KEYS, TOTAL SIZE, STORAGE VERSION

# Check which etcd version and tools are available
etcdctl version
etcdutl version
# For kubeadm static pods, check if etcdutl is in the image
crictl exec $(crictl ps --name etcd -q) etcdutl version

# Compare local hash to source after off-host transfer
sha256sum /path/to/snapshot.db

# Defragment to free space (blocks writes briefly per member)
# WARNING: run during a maintenance window
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  defrag --cluster

How to diagnose it

Confirm the source member is healthy. A snapshot from a lagging or partitioned member may contain inconsistent data. Run etcdctl endpoint health against the member, then etcdctl endpoint status --cluster --write-out=table. The leader should be stable, and followers should show minimal lag in the RAFT TERM and RAFT INDEX columns. If the member is unhealthy, switch to another member.
Check for active alarms. etcdctl alarm list must return nothing. If NOSPACE is active, etcd is read-only. No write-dependent operation, including a consistent snapshot, will succeed. Compact old revisions and defragment the cluster before retrying.
Validate tooling against etcd version. If you are on etcd 3.5.x, etcdctl snapshot restore works but emits deprecation warnings. If you are on etcd 3.6.0+, the command is removed and etcdutl is required. Verify binary availability before an incident.
Verify integrity after every transfer. etcdutl snapshot status performs a cryptographic integrity check. Run it immediately after creation and again after downloading from S3 or another store. If the hash changed, the file is corrupt. Do not proceed with a restore.
Inspect restore behavior for silent data loss. If the restore reports success but the restored --data-dir lacks db files, the snapshot was not written. Prefer etcdutl snapshot restore directly to avoid delegation bugs in older etcdctl versions.
Validate HA restore topology. For multi-member restores, every node must use the same --initial-cluster-token and the same --initial-cluster list. Each node uses its own --name and --initial-advertise-peer-urls, but the token and member list must be identical. Mismatches cause quorum loss or split-brain.

flowchart TD
    A[Snapshot fails or restore fails] --> B{etcdctl alarm list shows NOSPACE?}
    B -->|Yes| C[Compact and defrag, then retry]
    B -->|No| D{etcdutl snapshot status fails integrity?}
    D -->|Yes| E[Re-take snapshot and verify transfer hash]
    D -->|No| F{Restore fails or data missing?}
    F -->|Yes| G{etcd version >= 3.6?}
    G -->|Yes| H[Use etcdutl snapshot restore with --bump-revision and --mark-compacted]
    G -->|No| I[Use etcdutl snapshot restore directly avoid etcdctl bug]
    F -->|No| J[Verify --initial-cluster and --initial-cluster-token match across all HA members]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`etcd_mvcc_db_total_size_in_bytes`	Approaching quota causes `NOSPACE` and blocks snapshots	DB size > 75% of `--quota-backend-bytes`
`etcd_disk_wal_fsync_duration_seconds`	High fsync latency indicates disk pressure that can corrupt or slow snapshots	p99 > 100ms sustained
`etcd_server_leader_changes_seen_total`	Frequent leader changes mean the cluster is unstable; snapshots taken during churn may be inconsistent	> 0 per hour outside maintenance
`etcd_server_has_leader`	A member without a leader cannot guarantee a consistent snapshot	Gauge == 0
API server 5xx rate	etcd distress propagates as API server write failures	Sustained `apiserver_request_total{code=~"5.."}` > 0.1%
Snapshot file hash mismatch	Confirms corruption during transfer or storage	`etcdutl snapshot status` hash differs from source
`etcdutl` binary presence	etcd 3.6 requires `etcdutl` for restore and status	Binary missing from host or container image

Fixes

If the snapshot is corrupted or fails integrity checks

Re-take the snapshot from a healthy member. Do not attempt to restore a file that fails etcdutl snapshot status. If corruption occurs during transfer, verify network paths and compare SHA-256 sums at source and destination.

If NOSPACE blocks the backup

Run compaction and defragmentation during a low-traffic window.

REV=$(ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --write-out=json | jq -r '.[0].Status.header.revision')

# WARNING: compaction permanently removes all historical revisions
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  compact "$REV"

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  defrag --cluster

After resolving the space issue, disarm the alarm explicitly:

ETCDCTL_API=3 etcdctl alarm disarm

Then retry the snapshot.

If etcd 3.6 tooling is missing

Migrate all restore and status scripts to etcdutl. If you run etcd as a kubeadm static pod and the container image does not include etcdutl, download the matching etcd release tarball on the host and run etcdutl from the host path.

If restore fails silently or writes no data

Use etcdutl snapshot restore directly instead of delegating through etcdctl.

If HA restore causes split-brain or quorum loss

Restore each member independently with its own --name and --initial-advertise-peer-urls, but pass the exact same --initial-cluster and --initial-cluster-token on every node. For etcd 3.6, --initial-cluster-token is mandatory. Do not restore to one member and attempt to rejoin the others with the old topology.

If restoring to a Kubernetes cluster

Include revision bump flags to invalidate stale watcher caches:

etcdutl snapshot restore snapshot.db \
  --bump-revision 1000000000 \
  --mark-compacted \
  --data-dir /var/lib/etcd

Without this, API server informers may serve pre-snapshot state.

If raw db files lack a hash

Snapshots taken with etcdctl snapshot save include a hash. Raw database files copied from member/snap/ do not. If you must restore from a raw file, pass --skip-hash-check to etcdutl snapshot restore. This bypasses integrity verification and is a last resort.

Prevention

Automate snapshots and verify them. Schedule etcdctl snapshot save via a systemd timer or Kubernetes CronJob, then immediately run etcdutl snapshot status. Push the snapshot to object storage and re-verify after download.
Monitor DB size trends. Track etcd_mvcc_db_total_size_in_bytes against your quota. Enable automatic compaction and schedule periodic defragmentation so the DB does not approach the alarm threshold.
Keep tooling current. etcd 3.6 removed restore and status from etcdctl. Update runbooks and automation to use etcdutl before you need it.
Test restores quarterly. A verified snapshot is only useful if the restore procedure works. Perform a full restore to a temporary environment or isolated nodes to confirm flag correctness and revision bump behavior.
Protect static pod manifests. On kubeadm clusters, moving manifests out of /etc/kubernetes/manifests/ is required before stopping etcd for a restore. Document this step in your runbook so the kubelet does not restart etcd prematurely.

How Netdata helps

Use Netdata to correlate the following before scheduling snapshots:

etcd_mvcc_db_total_size_in_bytes against compaction schedules to predict the next NOSPACE alarm.
etcd_disk_wal_fsync_duration_seconds spikes to avoid I/O-bound backup windows.
etcd_server_leader_changes_seen_total to skip periods of Raft instability.
API server 5xx rate and latency as downstream indicators of etcd distress that can degrade snapshot consistency.

For etcd latency cascades affecting the API server, see Kubernetes API server etcd latency: detection and cascading failures.
For API server memory pressure during control plane recovery, see Kubernetes API server memory pressure: OOM cycle and tuning.
For API server unresponsiveness that can complicate etcd recovery, see Kubernetes API server slow or unresponsive: causes and fixes.
For certificate issues that can block etcd client authentication, see Kubernetes API server certificate rotation: detection and grace handling.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes etcd snapshot failures: backup, restore, and verification

Kubernetes etcd snapshot failures: backup, restore, and verification

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the snapshot is corrupted or fails integrity checks

If NOSPACE blocks the backup

If etcd 3.6 tooling is missing

If restore fails silently or writes no data

If HA restore causes split-brain or quorum loss

If restoring to a Kubernetes cluster

If raw db files lack a hash

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata