ClickHouse Replica is lost: SYSTEM RESTORE REPLICA and recovering a diverged replica

You run SELECT count() on two replicas and get different results. Your application returns inconsistent aggregations depending on which node answers. The replication dashboard looks green: ZooKeeper sessions are active, queue_size is zero, and no replica is readonly. The replica has permanently lost parts, but ZooKeeper does not know because the loss happened outside the replication log. Silent divergence hides behind healthy metrics until queries start returning wrong results.

What this means

ClickHouse replicated tables coordinate through a shared log in ZooKeeper or ClickHouse Keeper. Each replica downloads parts and applies log entries independently. If a part is lost locally – disk corruption, filesystem error, or an accidental DROP PARTITION on one node – the replica does not detect the absence unless a later log entry references that part. With no pending merge or fetch referencing the missing part, the replica reports an empty queue and continues serving queries with incomplete data. Load balancers keep routing traffic to it.

flowchart TD
  A[Disk corruption or DROP PARTITION] --> B[Parts removed locally]
  B --> C[Replica stays online]
  C --> D{ZK session active}
  D -->|Yes| E[Replication queue empty]
  E --> F[Replica appears healthy]
  F --> G[Queries serve incomplete data]
  G --> H[Silent divergence undetected]

Common causes

CauseWhat it looks likeFirst thing to check
Accidental DROP PARTITION on one replicaRow count drops on one node; other replicas unaffectedSELECT count() FROM db.table on each replica
Disk corruption or filesystem damageParts automatically detached; checksum errors in server logsSELECT name, reason FROM system.detached_parts WHERE database = 'db' AND table = 'table' and grep logs for Checksum or Broken part
Failed fetch never retriedReplicatedPartFailedFetches incremented; missing part ages out of queuesystem.replication_queue for entries with num_tries > 0 and last_exception != ''
Local merge or mutation bug (rare)Part count diverges after heavy mutation on a specific versionCross-replica comparison of system.parts grouped by partition_id

Quick checks

Run these on each replica and diff the output.

# Compare row counts across replicas
clickhouse-client -q "SELECT count() FROM db.table"
-- Compare partition-level part counts on each replica
SELECT
    partition_id,
    sum(rows) AS rows,
    count() AS parts
FROM system.parts
WHERE database = 'db' AND table = 'table' AND active = 1
GROUP BY partition_id
ORDER BY partition_id;
-- Check replica health indicators
SELECT
    database,
    table,
    is_readonly,
    is_session_expired,
    queue_size,
    absolute_delay
FROM system.replicas
WHERE database = 'db' AND table = 'table';
-- Look for stuck or failed replication entries
SELECT
    database,
    table,
    type,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE database = 'db' AND table = 'table' AND num_tries > 0;
-- Check for parts detached due to corruption
SELECT
    name,
    reason,
    modification_time
FROM system.detached_parts
WHERE database = 'db' AND table = 'table';
-- Check for data loss events detected by the server
SELECT event, value
FROM system.events
WHERE event IN ('ReplicatedDataLoss', 'ReplicatedPartChecksFailed');

How to diagnose it

  1. Confirm divergence. Run SELECT count() FROM db.table on every replica. The node with the lower count has lost data.
  2. Drill down to the partition level. Run the partition-level system.parts query on each replica and diff the results. Identify exactly which partition_id values have fewer parts or rows on the diverged node.
  3. Verify replication queue state. Query system.replicas for queue_size, is_readonly, and is_session_expired. In silent divergence these look healthy. An empty queue combined with a row count mismatch is the smoking gun.
  4. Inspect system.detached_parts on the diverged replica. Look for reasons like broken, corruption, or fetch error that explain why parts left the active set.
  5. Check system.events for ReplicatedDataLoss. Any non-zero value means ClickHouse has already detected that a part was irretrievably lost.
  6. If counts are identical but you suspect corruption, run CHECK TABLE db.table on the diverged replica. This is a read-only operation that verifies part checksums without altering data.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Cross-replica row countOnly definitive proof of silent divergenceSELECT count() returns different values for the same table across replicas
system.replicas.queue_sizeEmpty queue with count mismatch confirms silent lossQueue is zero but data counts differ
system.events.ReplicatedDataLossClickHouse detected irretrievable part lossAny non-zero value
system.detached_parts.reasonExplains why parts left the active setReasons containing broken, corruption, or fetch
Per-partition active partsIdentifies exactly which partitions are missingMismatch in count() or sum(rows) per partition_id

Fixes

Re-check with SYSTEM RESTART REPLICA

The first recovery step is SYSTEM RESTART REPLICA db.table. This forces the replica to restart its replication background threads and re-evaluate which parts it should own against ZooKeeper. If the replication log still references the missing parts and healthy peers have copies, the replica schedules fetches to fill the gaps. This is non-destructive and preserves local data that is still valid.

Run it on the diverged replica, then monitor system.replication_queue for new GET_PART entries:

SELECT
    type,
    source_replica,
    new_part_name,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE database = 'db' AND table = 'table'
ORDER BY create_time;

If GET_PART entries appear and num_tries stays at zero, the replica is healing. Wait for queue_size to return to zero and re-verify row counts.

Re-initialize with SYSTEM RESTORE REPLICA

If SYSTEM RESTART REPLICA does not resolve the mismatch, or if system.detached_parts shows widespread corruption, use SYSTEM RESTORE REPLICA db.table. This drops the replica’s ZooKeeper registration and recreates it, then schedules fetches for every active part from other replicas.

Warning: This is disruptive. The replica will not contain complete local data until the fetches finish. Reads may return partial results during recovery. The duration depends on data size, part count, and network bandwidth. Remove the replica from your load balancer or Distributed table rotation before running the command. Run it during a maintenance window.

After issuing the command, watch system.replication_queue on the target replica. You should see a large burst of GET_PART entries. Do not return the replica to service until queue_size reaches zero and cross-replica row counts match.

Manual partition recovery (rare)

If only one or two partitions are missing and you know exactly which ones, you can drop the broken partitions and reattach data from backup. This is only practical when the rest of the replica is healthy and the missing data is small. In most production cases, the two SYSTEM commands are safer and faster.

Prevention

  • Periodic cross-replica audits. Schedule automated row-count and checksum comparisons across replicas for critical tables. Silent divergence is only detectable by comparing actual data.
  • Monitor ReplicatedDataLoss and ReplicatedPartChecksFailed. Any increment in these event counters warrants immediate investigation.
  • Watch system.detached_parts. Growing detached parts with reasons like broken or corruption predict divergence before it becomes silent.
  • Use ON CLUSTER for DDL. Never run DROP PARTITION, ALTER, or TRUNCATE against a single replica. Always use the ON CLUSTER clause so the operation is logged in ZooKeeper and applied consistently.
  • Run CHECK TABLE during low-traffic windows. This verifies part checksums proactively and can surface corruption before parts are needed for queries or merges.

How Netdata helps

  • Correlate hardware-level disk health alerts with replication gaps. A spike in disk errors followed by a diverged replica points to the root cause.
  • Track replication queue depth per replica in real time. A sudden drop to zero after a period of fetch failures can signal a failed part that aged out of the queue.
  • Alert on detached parts growth. Netdata can surface when system.detached_parts increases, giving early warning before divergence affects query results.
  • Correlate query latency anomalies on individual replicas with potential stale-data serving. A replica with missing parts may return faster but incorrect results.
  • Monitor inter-server network throughput during recovery. When SYSTEM RESTORE REPLICA runs, you should see sustained fetch traffic; if bandwidth is low, recovery will stall.