ClickHouse ReplicatedDataLoss > 0: detecting and responding to lost parts

ReplicatedDataLoss > 0 is a hard signal in ClickHouse. A nonzero value in system.events means the server has determined that a data part is missing and cannot be retrieved from any available replica. This is not replication lag, a transient fetch failure, or the normal ReplicatedPartFetchesOfMerged optimization.

Queries that touch the affected part can return incomplete results or errors. The immediate risk is silent divergence between replicas, where one replica serves stale or incomplete results without failing the query. Confirm the event, identify the scope, and determine whether a healthy peer still has the part.

What this means

ReplicatedDataLoss increments only after the server exhausts recovery options for a part. Distinguish it from:

  • ReplicatedPartFailedFetches: temporary fetch degradation that may self-resolve when the source recovers.
  • ReplicatedPartChecksFailed: integrity problems that may escalate to declared loss if left unaddressed.
  • ReplicatedPartFetchesOfMerged: normal optimization where replicas fetch already-merged parts from peers instead of merging locally. This counter grows steadily and is expected behavior.

When ReplicatedDataLoss, ReplicatedPartChecksFailed, and ReplicatedPartFailedFetches climb together, a replica is failing to fetch a part, failing to verify it, and ultimately giving up.

Common causes

CauseWhat it looks likeFirst thing to check
Source replica disk corruption or checksum failureReplicatedPartChecksFailed climbing alongside ReplicatedDataLoss; parts may appear in system.detached_partssystem.replication_queue.last_exception for checksum or corrupt part errors
Source replica unavailable or network partitionedReplicatedPartFailedFetches increasing; queue entries retrying against one sourcesystem.replicas for active_replicas < total_replicas or is_session_expired = 1
Part merged or dropped on source before replica fetches itFetch fails because the unmerged source part no longer exists; queue shows GET_PART with high num_triessystem.parts on the source for the merged part; check whether ReplicatedPartFetchesOfMerged is incrementing
Accidental local deletion or detached partsLocal parts missing but other replicas healthy; queue may be empty while data divergesCross-replica row and partition counts
Hardware or filesystem-level corruptionDetached parts with corruption-related reasons; errors in OS logssystem.detached_parts.reason and dmesg for hardware errors

Quick checks

Run these safe, read-only checks to orient yourself during the first minutes of the incident.

-- Check event counters for loss and related errors
SELECT event, value FROM system.events
WHERE event IN (
    'ReplicatedPartFailedFetches',
    'ReplicatedPartChecksFailed',
    'ReplicatedDataLoss',
    'ReplicatedPartFetchesOfMerged'
);
-- Inspect the replication queue for stuck entries
SELECT database, table, type, source_replica,
       num_tries, last_exception,
       create_time, last_attempt_time
FROM system.replication_queue
WHERE num_tries > 3
ORDER BY num_tries DESC;
-- Check replica availability and session state
SELECT database, table, is_leader, is_readonly, is_session_expired,
       active_replicas, total_replicas
FROM system.replicas
WHERE is_session_expired = 1 OR is_readonly = 1;
-- Look for parts detached due to corruption or fetch failures
SELECT database, table, name, reason, modification_time
FROM system.detached_parts
ORDER BY modification_time DESC;
-- Search ClickHouse logs for corruption indicators
grep -Ei 'checksum|corrupt|Broken part|Cannot read all data|Mismatch' /var/log/clickhouse-server/*.log | tail -100
-- Check OS-level hardware errors
dmesg | tail -100
-- Compare row counts across replicas for suspected tables
-- Run this on each replica and compare results
SELECT count() FROM your_db.your_table;
-- Compare partition-level counts across replicas
-- Run this on each replica
SELECT partition_id, sum(rows) FROM system.parts WHERE active GROUP BY partition_id;

How to diagnose it

flowchart TD
    A[ReplicatedDataLoss alert] --> B{Distinguish from normal fetches}
    B --> C[Check system.replication_queue]
    C --> D{Stuck entries with exceptions?}
    D -->|Yes| E[Check source replica health]
    D -->|No| F[Compare row counts across replicas]
    E --> G{Source has the part?}
    G -->|Yes| H[Restart replica to re-fetch]
    G -->|No| I[Assess blast radius]
    F --> J[Divergence found] --> I
    F --> K[No divergence] --> L[Investigate detached parts]
    H --> M[Monitor queue for completion]
    I --> N[Restore from peer or rebuild partition]
  1. Confirm the event. Query system.events for ReplicatedDataLoss. If it is nonzero, check ReplicatedPartFetchesOfMerged at the same time. If only ReplicatedPartFetchesOfMerged is moving and the others are flat, this is normal optimization traffic, not data loss.

  2. Identify the affected table and replica. Use system.replication_queue to find entries with high num_tries and non-empty last_exception. The database, table, and source_replica columns tell you which peer the replica was trying to fetch from when it failed.

  3. Check source replica health. On the source replica, verify it is not readonly or session-expired using system.replicas. Check its system.parts to see if the missing part still exists there. If the source has the part but is refusing fetches due to network or load, the loss may be recoverable once the source stabilizes.

  4. Assess whether the part was merged away. If the queue shows GET_PART for an unmerged part that no longer exists on the source, check whether the merged result is available. ClickHouse often fetches the merged part instead via ReplicatedPartFetchesOfMerged. If the merged part is also missing everywhere, proceed to blast-radius assessment.

  5. Measure blast radius. Run SELECT count() FROM table on every replica. If counts differ, the loss has already caused divergence. Use SELECT partition_id, sum(rows) FROM system.parts WHERE active GROUP BY partition_id on each replica to identify exactly which partitions are affected. Check system.detached_parts to see if the part was locally detached rather than lost from the cluster.

  6. Check for systemic hardware issues. If last_exception mentions checksum failures or system.detached_parts shows corruption-related reasons, check dmesg for disk or memory errors on both the affected replica and the source. Repeated corruption on the same host indicates a hardware problem.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ReplicatedDataLossDirect indicator that a part is irretrievably lostAny nonzero value
ReplicatedPartChecksFailedIntegrity check failures can escalate into declared data lossSustained increase over multiple minutes
ReplicatedPartFailedFetchesDegraded replication; source replica may be failing or unreachableNonzero rate sustained outside of restart recovery
system.replication_queue.num_triesPermanently stuck entries will not self-healAny entry with num_tries > 10
system.detached_partsParts removed due to corruption or failed operationsUnexpected growth with corruption-related reasons
Cross-replica row countsDetects silent divergence when the replication queue appears healthyMismatch between replicas for the same table

Fixes

If a healthy replica still has the part

When another replica holds the missing part, force the affected replica to re-evaluate its state against ZooKeeper or ClickHouse Keeper and re-fetch.

-- Force re-check against coordination service state
SYSTEM RESTART REPLICA db.table;

After running this, monitor system.replication_queue to confirm the part is being fetched and that num_tries resets. If the replica is widely diverged or SYSTEM RESTART REPLICA does not resolve the loss, use:

-- Reinitializes replica metadata and triggers re-fetches from peers
SYSTEM RESTORE REPLICA db.table;

Warning: SYSTEM RESTORE REPLICA reinitializes the replica and can force a full re-fetch from peers. It is disruptive and I/O-intensive.

If no replica has the part

When ReplicatedDataLoss has incremented and no peer can provide the part, the data is confirmed lost. Determine the blast radius: which table, which partition, and what time range. Recover from your organization’s backup procedures if they cover the affected partition. If no backup is available, you may need to drop or detach the affected partition to prevent queries from failing on missing data.

Handle detached parts

If system.detached_parts shows parts with corruption-related reasons, do not reattach them blindly. The reason column explains why ClickHouse removed them. Investigate the underlying cause, which is often hardware or filesystem corruption. If a healthy source replica exists, let the replica re-fetch the part instead of reattaching a potentially corrupt local copy.

Address underlying hardware

If dmesg or system.detached_parts points to disk or memory corruption, replace the affected hardware before restoring the replica. Re-fetching parts onto the same failing disk will reproduce the corruption and trigger another data loss event.

Prevention

  • Monitor ReplicatedPartChecksFailed and ReplicatedPartFailedFetches as leading indicators. Address them before they compound into ReplicatedDataLoss.
  • Monitor system.replication_queue for stuck entries with high num_tries. A queue entry that is not making progress is a data-loss risk.
  • Run periodic cross-replica row count and partition-level comparisons. Silent divergence produces zero queue entries and no standard replication alerts.
  • Do not treat ZooKeeper or ClickHouse Keeper as fire-and-forget infrastructure. Session expiry and coordination latency directly cascade into replication failures.
  • Investigate unexpected system.detached_parts immediately. They are often the first visible sign of disk or filesystem corruption.

How Netdata helps

  • Correlates ReplicatedDataLoss with ReplicatedPartFailedFetches and ReplicatedPartChecksFailed so you can see whether the loss followed a fetch degradation or an integrity failure.
  • Alerts on nonzero ReplicatedDataLoss.
  • Surfaces replication queue depth, stuck entries, and replica session state alongside the event to accelerate root cause analysis.
  • Correlates replication errors with disk I/O, network throughput, and ZooKeeper health signals to distinguish source replica pressure from coordination failures.
  • Tracks per-replica lag and availability, helping you identify which peer should serve as the recovery source.