ClickHouse Replica is lost: SYSTEM RESTORE REPLICA and recovering a diverged replica
You run SELECT count() on two replicas and get different results. Your application returns inconsistent aggregations depending on which node answers. The replication dashboard looks green: ZooKeeper sessions are active, queue_size is zero, and no replica is readonly. The replica has permanently lost parts, but ZooKeeper does not know because the loss happened outside the replication log. Silent divergence hides behind healthy metrics until queries start returning wrong results.
What this means
ClickHouse replicated tables coordinate through a shared log in ZooKeeper or ClickHouse Keeper. Each replica downloads parts and applies log entries independently. If a part is lost locally – disk corruption, filesystem error, or an accidental DROP PARTITION on one node – the replica does not detect the absence unless a later log entry references that part. With no pending merge or fetch referencing the missing part, the replica reports an empty queue and continues serving queries with incomplete data. Load balancers keep routing traffic to it.
flowchart TD
A[Disk corruption or DROP PARTITION] --> B[Parts removed locally]
B --> C[Replica stays online]
C --> D{ZK session active}
D -->|Yes| E[Replication queue empty]
E --> F[Replica appears healthy]
F --> G[Queries serve incomplete data]
G --> H[Silent divergence undetected]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Accidental DROP PARTITION on one replica | Row count drops on one node; other replicas unaffected | SELECT count() FROM db.table on each replica |
| Disk corruption or filesystem damage | Parts automatically detached; checksum errors in server logs | SELECT name, reason FROM system.detached_parts WHERE database = 'db' AND table = 'table' and grep logs for Checksum or Broken part |
| Failed fetch never retried | ReplicatedPartFailedFetches incremented; missing part ages out of queue | system.replication_queue for entries with num_tries > 0 and last_exception != '' |
| Local merge or mutation bug (rare) | Part count diverges after heavy mutation on a specific version | Cross-replica comparison of system.parts grouped by partition_id |
Quick checks
Run these on each replica and diff the output.
# Compare row counts across replicas
clickhouse-client -q "SELECT count() FROM db.table"
-- Compare partition-level part counts on each replica
SELECT
partition_id,
sum(rows) AS rows,
count() AS parts
FROM system.parts
WHERE database = 'db' AND table = 'table' AND active = 1
GROUP BY partition_id
ORDER BY partition_id;
-- Check replica health indicators
SELECT
database,
table,
is_readonly,
is_session_expired,
queue_size,
absolute_delay
FROM system.replicas
WHERE database = 'db' AND table = 'table';
-- Look for stuck or failed replication entries
SELECT
database,
table,
type,
num_tries,
last_exception
FROM system.replication_queue
WHERE database = 'db' AND table = 'table' AND num_tries > 0;
-- Check for parts detached due to corruption
SELECT
name,
reason,
modification_time
FROM system.detached_parts
WHERE database = 'db' AND table = 'table';
-- Check for data loss events detected by the server
SELECT event, value
FROM system.events
WHERE event IN ('ReplicatedDataLoss', 'ReplicatedPartChecksFailed');
How to diagnose it
- Confirm divergence. Run
SELECT count() FROM db.tableon every replica. The node with the lower count has lost data. - Drill down to the partition level. Run the partition-level
system.partsquery on each replica and diff the results. Identify exactly whichpartition_idvalues have fewer parts or rows on the diverged node. - Verify replication queue state. Query
system.replicasforqueue_size,is_readonly, andis_session_expired. In silent divergence these look healthy. An empty queue combined with a row count mismatch is the smoking gun. - Inspect
system.detached_partson the diverged replica. Look for reasons likebroken,corruption, orfetch errorthat explain why parts left the active set. - Check
system.eventsforReplicatedDataLoss. Any non-zero value means ClickHouse has already detected that a part was irretrievably lost. - If counts are identical but you suspect corruption, run
CHECK TABLE db.tableon the diverged replica. This is a read-only operation that verifies part checksums without altering data.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Cross-replica row count | Only definitive proof of silent divergence | SELECT count() returns different values for the same table across replicas |
system.replicas.queue_size | Empty queue with count mismatch confirms silent loss | Queue is zero but data counts differ |
system.events.ReplicatedDataLoss | ClickHouse detected irretrievable part loss | Any non-zero value |
system.detached_parts.reason | Explains why parts left the active set | Reasons containing broken, corruption, or fetch |
| Per-partition active parts | Identifies exactly which partitions are missing | Mismatch in count() or sum(rows) per partition_id |
Fixes
Re-check with SYSTEM RESTART REPLICA
The first recovery step is SYSTEM RESTART REPLICA db.table. This forces the replica to restart its replication background threads and re-evaluate which parts it should own against ZooKeeper. If the replication log still references the missing parts and healthy peers have copies, the replica schedules fetches to fill the gaps. This is non-destructive and preserves local data that is still valid.
Run it on the diverged replica, then monitor system.replication_queue for new GET_PART entries:
SELECT
type,
source_replica,
new_part_name,
num_tries,
last_exception
FROM system.replication_queue
WHERE database = 'db' AND table = 'table'
ORDER BY create_time;
If GET_PART entries appear and num_tries stays at zero, the replica is healing. Wait for queue_size to return to zero and re-verify row counts.
Re-initialize with SYSTEM RESTORE REPLICA
If SYSTEM RESTART REPLICA does not resolve the mismatch, or if system.detached_parts shows widespread corruption, use SYSTEM RESTORE REPLICA db.table. This drops the replica’s ZooKeeper registration and recreates it, then schedules fetches for every active part from other replicas.
Warning: This is disruptive. The replica will not contain complete local data until the fetches finish. Reads may return partial results during recovery. The duration depends on data size, part count, and network bandwidth. Remove the replica from your load balancer or Distributed table rotation before running the command. Run it during a maintenance window.
After issuing the command, watch system.replication_queue on the target replica. You should see a large burst of GET_PART entries. Do not return the replica to service until queue_size reaches zero and cross-replica row counts match.
Manual partition recovery (rare)
If only one or two partitions are missing and you know exactly which ones, you can drop the broken partitions and reattach data from backup. This is only practical when the rest of the replica is healthy and the missing data is small. In most production cases, the two SYSTEM commands are safer and faster.
Prevention
- Periodic cross-replica audits. Schedule automated row-count and checksum comparisons across replicas for critical tables. Silent divergence is only detectable by comparing actual data.
- Monitor
ReplicatedDataLossandReplicatedPartChecksFailed. Any increment in these event counters warrants immediate investigation. - Watch
system.detached_parts. Growing detached parts with reasons likebrokenorcorruptionpredict divergence before it becomes silent. - Use
ON CLUSTERfor DDL. Never runDROP PARTITION,ALTER, orTRUNCATEagainst a single replica. Always use theON CLUSTERclause so the operation is logged in ZooKeeper and applied consistently. - Run
CHECK TABLEduring low-traffic windows. This verifies part checksums proactively and can surface corruption before parts are needed for queries or merges.
How Netdata helps
- Correlate hardware-level disk health alerts with replication gaps. A spike in disk errors followed by a diverged replica points to the root cause.
- Track replication queue depth per replica in real time. A sudden drop to zero after a period of fetch failures can signal a failed part that aged out of the queue.
- Alert on detached parts growth. Netdata can surface when
system.detached_partsincreases, giving early warning before divergence affects query results. - Correlate query latency anomalies on individual replicas with potential stale-data serving. A replica with missing parts may return faster but incorrect results.
- Monitor inter-server network throughput during recovery. When
SYSTEM RESTORE REPLICAruns, you should see sustained fetch traffic; if bandwidth is low, recovery will stall.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool







