$ guides / clickhouse / clickhouse-replica-is-lost ▌

Operations Guides

ClickHouse Replica is lost: SYSTEM RESTORE REPLICA and recovering a diverged replica

You run SELECT count() on two replicas and get different results. Your application returns inconsistent aggregations depending on which node answers. The replication dashboard looks green: ZooKeeper sessions are active, queue_size is zero, and no replica is readonly. The replica has permanently lost parts, but ZooKeeper does not know because the loss happened outside the replication log. Silent divergence hides behind healthy metrics until queries start returning wrong results.

What this means

ClickHouse replicated tables coordinate through a shared log in ZooKeeper or ClickHouse Keeper. Each replica downloads parts and applies log entries independently. If a part is lost locally – disk corruption, filesystem error, or an accidental DROP PARTITION on one node – the replica does not detect the absence unless a later log entry references that part. With no pending merge or fetch referencing the missing part, the replica reports an empty queue and continues serving queries with incomplete data. Load balancers keep routing traffic to it.

flowchart TD
  A[Disk corruption or DROP PARTITION] --> B[Parts removed locally]
  B --> C[Replica stays online]
  C --> D{ZK session active}
  D -->|Yes| E[Replication queue empty]
  E --> F[Replica appears healthy]
  F --> G[Queries serve incomplete data]
  G --> H[Silent divergence undetected]

Common causes

Cause	What it looks like	First thing to check
Accidental `DROP PARTITION` on one replica	Row count drops on one node; other replicas unaffected	`SELECT count() FROM db.table` on each replica
Disk corruption or filesystem damage	Parts automatically detached; checksum errors in server logs	`SELECT name, reason FROM system.detached_parts WHERE database = 'db' AND table = 'table'` and grep logs for `Checksum` or `Broken part`
Failed fetch never retried	`ReplicatedPartFailedFetches` incremented; missing part ages out of queue	`system.replication_queue` for entries with `num_tries > 0` and `last_exception != ''`
Local merge or mutation bug (rare)	Part count diverges after heavy mutation on a specific version	Cross-replica comparison of `system.parts` grouped by `partition_id`

Quick checks

Run these on each replica and diff the output.

# Compare row counts across replicas
clickhouse-client -q "SELECT count() FROM db.table"

-- Compare partition-level part counts on each replica
SELECT
    partition_id,
    sum(rows) AS rows,
    count() AS parts
FROM system.parts
WHERE database = 'db' AND table = 'table' AND active = 1
GROUP BY partition_id
ORDER BY partition_id;

-- Check replica health indicators
SELECT
    database,
    table,
    is_readonly,
    is_session_expired,
    queue_size,
    absolute_delay
FROM system.replicas
WHERE database = 'db' AND table = 'table';

-- Look for stuck or failed replication entries
SELECT
    database,
    table,
    type,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE database = 'db' AND table = 'table' AND num_tries > 0;

-- Check for parts detached due to corruption
SELECT
    name,
    reason,
    modification_time
FROM system.detached_parts
WHERE database = 'db' AND table = 'table';

-- Check for data loss events detected by the server
SELECT event, value
FROM system.events
WHERE event IN ('ReplicatedDataLoss', 'ReplicatedPartChecksFailed');

How to diagnose it

Confirm divergence. Run SELECT count() FROM db.table on every replica. The node with the lower count has lost data.
Drill down to the partition level. Run the partition-level system.parts query on each replica and diff the results. Identify exactly which partition_id values have fewer parts or rows on the diverged node.
Verify replication queue state. Query system.replicas for queue_size, is_readonly, and is_session_expired. In silent divergence these look healthy. An empty queue combined with a row count mismatch is the smoking gun.
Inspect system.detached_parts on the diverged replica. Look for reasons like broken, corruption, or fetch error that explain why parts left the active set.
Check system.events for ReplicatedDataLoss. Any non-zero value means ClickHouse has already detected that a part was irretrievably lost.
If counts are identical but you suspect corruption, run CHECK TABLE db.table on the diverged replica. This is a read-only operation that verifies part checksums without altering data.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Cross-replica row count	Only definitive proof of silent divergence	`SELECT count()` returns different values for the same table across replicas
`system.replicas.queue_size`	Empty queue with count mismatch confirms silent loss	Queue is zero but data counts differ
`system.events.ReplicatedDataLoss`	ClickHouse detected irretrievable part loss	Any non-zero value
`system.detached_parts.reason`	Explains why parts left the active set	Reasons containing `broken`, `corruption`, or `fetch`
Per-partition active parts	Identifies exactly which partitions are missing	Mismatch in `count()` or `sum(rows)` per `partition_id`

Fixes

Re-check with SYSTEM RESTART REPLICA

The first recovery step is SYSTEM RESTART REPLICA db.table. This forces the replica to restart its replication background threads and re-evaluate which parts it should own against ZooKeeper. If the replication log still references the missing parts and healthy peers have copies, the replica schedules fetches to fill the gaps. This is non-destructive and preserves local data that is still valid.

Run it on the diverged replica, then monitor system.replication_queue for new GET_PART entries:

SELECT
    type,
    source_replica,
    new_part_name,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE database = 'db' AND table = 'table'
ORDER BY create_time;

If GET_PART entries appear and num_tries stays at zero, the replica is healing. Wait for queue_size to return to zero and re-verify row counts.

Re-initialize with SYSTEM RESTORE REPLICA

If SYSTEM RESTART REPLICA does not resolve the mismatch, or if system.detached_parts shows widespread corruption, use SYSTEM RESTORE REPLICA db.table. This drops the replica’s ZooKeeper registration and recreates it, then schedules fetches for every active part from other replicas.

Warning: This is disruptive. The replica will not contain complete local data until the fetches finish. Reads may return partial results during recovery. The duration depends on data size, part count, and network bandwidth. Remove the replica from your load balancer or Distributed table rotation before running the command. Run it during a maintenance window.

After issuing the command, watch system.replication_queue on the target replica. You should see a large burst of GET_PART entries. Do not return the replica to service until queue_size reaches zero and cross-replica row counts match.

Manual partition recovery (rare)

If only one or two partitions are missing and you know exactly which ones, you can drop the broken partitions and reattach data from backup. This is only practical when the rest of the replica is healthy and the missing data is small. In most production cases, the two SYSTEM commands are safer and faster.

Prevention

Periodic cross-replica audits. Schedule automated row-count and checksum comparisons across replicas for critical tables. Silent divergence is only detectable by comparing actual data.
Monitor ReplicatedDataLoss and ReplicatedPartChecksFailed. Any increment in these event counters warrants immediate investigation.
Watch system.detached_parts. Growing detached parts with reasons like broken or corruption predict divergence before it becomes silent.
Use ON CLUSTER for DDL. Never run DROP PARTITION, ALTER, or TRUNCATE against a single replica. Always use the ON CLUSTER clause so the operation is logged in ZooKeeper and applied consistently.
Run CHECK TABLE during low-traffic windows. This verifies part checksums proactively and can surface corruption before parts are needed for queries or merges.

How Netdata helps

Correlate hardware-level disk health alerts with replication gaps. A spike in disk errors followed by a diverged replica points to the root cause.
Track replication queue depth per replica in real time. A sudden drop to zero after a period of fetch failures can signal a failed part that aged out of the queue.
Alert on detached parts growth. Netdata can surface when system.detached_parts increases, giving early warning before divergence affects query results.
Correlate query latency anomalies on individual replicas with potential stale-data serving. A replica with missing parts may return faster but incorrect results.
Monitor inter-server network throughput during recovery. When SYSTEM RESTORE REPLICA runs, you should see sustained fetch traffic; if bandwidth is low, recovery will stall.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse Replica is lost: SYSTEM RESTORE REPLICA and recovering a diverged replica

ClickHouse Replica is lost: SYSTEM RESTORE REPLICA and recovering a diverged replica

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Re-check with SYSTEM RESTART REPLICA

Re-initialize with SYSTEM RESTORE REPLICA

Manual partition recovery (rare)

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata