$ guides / clickhouse / clickhouse-replicated-data-loss ▌

Operations Guides

ClickHouse ReplicatedDataLoss > 0: detecting and responding to lost parts

ReplicatedDataLoss > 0 is a hard signal in ClickHouse. A nonzero value in system.events means the server has determined that a data part is missing and cannot be retrieved from any available replica. This is not replication lag, a transient fetch failure, or the normal ReplicatedPartFetchesOfMerged optimization.

Queries that touch the affected part can return incomplete results or errors. The immediate risk is silent divergence between replicas, where one replica serves stale or incomplete results without failing the query. Confirm the event, identify the scope, and determine whether a healthy peer still has the part.

What this means

ReplicatedDataLoss increments only after the server exhausts recovery options for a part. Distinguish it from:

ReplicatedPartFailedFetches: temporary fetch degradation that may self-resolve when the source recovers.
ReplicatedPartChecksFailed: integrity problems that may escalate to declared loss if left unaddressed.
ReplicatedPartFetchesOfMerged: normal optimization where replicas fetch already-merged parts from peers instead of merging locally. This counter grows steadily and is expected behavior.

When ReplicatedDataLoss, ReplicatedPartChecksFailed, and ReplicatedPartFailedFetches climb together, a replica is failing to fetch a part, failing to verify it, and ultimately giving up.

Common causes

Cause	What it looks like	First thing to check
Source replica disk corruption or checksum failure	`ReplicatedPartChecksFailed` climbing alongside `ReplicatedDataLoss`; parts may appear in `system.detached_parts`	`system.replication_queue.last_exception` for checksum or corrupt part errors
Source replica unavailable or network partitioned	`ReplicatedPartFailedFetches` increasing; queue entries retrying against one source	`system.replicas` for `active_replicas < total_replicas` or `is_session_expired = 1`
Part merged or dropped on source before replica fetches it	Fetch fails because the unmerged source part no longer exists; queue shows `GET_PART` with high `num_tries`	`system.parts` on the source for the merged part; check whether `ReplicatedPartFetchesOfMerged` is incrementing
Accidental local deletion or detached parts	Local parts missing but other replicas healthy; queue may be empty while data diverges	Cross-replica row and partition counts
Hardware or filesystem-level corruption	Detached parts with corruption-related reasons; errors in OS logs	`system.detached_parts.reason` and `dmesg` for hardware errors

Quick checks

Run these safe, read-only checks to orient yourself during the first minutes of the incident.

-- Check event counters for loss and related errors
SELECT event, value FROM system.events
WHERE event IN (
    'ReplicatedPartFailedFetches',
    'ReplicatedPartChecksFailed',
    'ReplicatedDataLoss',
    'ReplicatedPartFetchesOfMerged'
);

-- Inspect the replication queue for stuck entries
SELECT database, table, type, source_replica,
       num_tries, last_exception,
       create_time, last_attempt_time
FROM system.replication_queue
WHERE num_tries > 3
ORDER BY num_tries DESC;

-- Check replica availability and session state
SELECT database, table, is_leader, is_readonly, is_session_expired,
       active_replicas, total_replicas
FROM system.replicas
WHERE is_session_expired = 1 OR is_readonly = 1;

-- Look for parts detached due to corruption or fetch failures
SELECT database, table, name, reason, modification_time
FROM system.detached_parts
ORDER BY modification_time DESC;

-- Search ClickHouse logs for corruption indicators
grep -Ei 'checksum|corrupt|Broken part|Cannot read all data|Mismatch' /var/log/clickhouse-server/*.log | tail -100

-- Check OS-level hardware errors
dmesg | tail -100

-- Compare row counts across replicas for suspected tables
-- Run this on each replica and compare results
SELECT count() FROM your_db.your_table;

-- Compare partition-level counts across replicas
-- Run this on each replica
SELECT partition_id, sum(rows) FROM system.parts WHERE active GROUP BY partition_id;

How to diagnose it

flowchart TD
    A[ReplicatedDataLoss alert] --> B{Distinguish from normal fetches}
    B --> C[Check system.replication_queue]
    C --> D{Stuck entries with exceptions?}
    D -->|Yes| E[Check source replica health]
    D -->|No| F[Compare row counts across replicas]
    E --> G{Source has the part?}
    G -->|Yes| H[Restart replica to re-fetch]
    G -->|No| I[Assess blast radius]
    F --> J[Divergence found] --> I
    F --> K[No divergence] --> L[Investigate detached parts]
    H --> M[Monitor queue for completion]
    I --> N[Restore from peer or rebuild partition]

Confirm the event. Query system.events for ReplicatedDataLoss. If it is nonzero, check ReplicatedPartFetchesOfMerged at the same time. If only ReplicatedPartFetchesOfMerged is moving and the others are flat, this is normal optimization traffic, not data loss.
Identify the affected table and replica. Use system.replication_queue to find entries with high num_tries and non-empty last_exception. The database, table, and source_replica columns tell you which peer the replica was trying to fetch from when it failed.
Check source replica health. On the source replica, verify it is not readonly or session-expired using system.replicas. Check its system.parts to see if the missing part still exists there. If the source has the part but is refusing fetches due to network or load, the loss may be recoverable once the source stabilizes.
Assess whether the part was merged away. If the queue shows GET_PART for an unmerged part that no longer exists on the source, check whether the merged result is available. ClickHouse often fetches the merged part instead via ReplicatedPartFetchesOfMerged. If the merged part is also missing everywhere, proceed to blast-radius assessment.
Measure blast radius. Run SELECT count() FROM table on every replica. If counts differ, the loss has already caused divergence. Use SELECT partition_id, sum(rows) FROM system.parts WHERE active GROUP BY partition_id on each replica to identify exactly which partitions are affected. Check system.detached_parts to see if the part was locally detached rather than lost from the cluster.
Check for systemic hardware issues. If last_exception mentions checksum failures or system.detached_parts shows corruption-related reasons, check dmesg for disk or memory errors on both the affected replica and the source. Repeated corruption on the same host indicates a hardware problem.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`ReplicatedDataLoss`	Direct indicator that a part is irretrievably lost	Any nonzero value
`ReplicatedPartChecksFailed`	Integrity check failures can escalate into declared data loss	Sustained increase over multiple minutes
`ReplicatedPartFailedFetches`	Degraded replication; source replica may be failing or unreachable	Nonzero rate sustained outside of restart recovery
`system.replication_queue.num_tries`	Permanently stuck entries will not self-heal	Any entry with `num_tries > 10`
`system.detached_parts`	Parts removed due to corruption or failed operations	Unexpected growth with corruption-related reasons
Cross-replica row counts	Detects silent divergence when the replication queue appears healthy	Mismatch between replicas for the same table

Fixes

If a healthy replica still has the part

When another replica holds the missing part, force the affected replica to re-evaluate its state against ZooKeeper or ClickHouse Keeper and re-fetch.

-- Force re-check against coordination service state
SYSTEM RESTART REPLICA db.table;

After running this, monitor system.replication_queue to confirm the part is being fetched and that num_tries resets. If the replica is widely diverged or SYSTEM RESTART REPLICA does not resolve the loss, use:

-- Reinitializes replica metadata and triggers re-fetches from peers
SYSTEM RESTORE REPLICA db.table;

Warning: SYSTEM RESTORE REPLICA reinitializes the replica and can force a full re-fetch from peers. It is disruptive and I/O-intensive.

If no replica has the part

When ReplicatedDataLoss has incremented and no peer can provide the part, the data is confirmed lost. Determine the blast radius: which table, which partition, and what time range. Recover from your organization’s backup procedures if they cover the affected partition. If no backup is available, you may need to drop or detach the affected partition to prevent queries from failing on missing data.

Handle detached parts

If system.detached_parts shows parts with corruption-related reasons, do not reattach them blindly. The reason column explains why ClickHouse removed them. Investigate the underlying cause, which is often hardware or filesystem corruption. If a healthy source replica exists, let the replica re-fetch the part instead of reattaching a potentially corrupt local copy.

Address underlying hardware

If dmesg or system.detached_parts points to disk or memory corruption, replace the affected hardware before restoring the replica. Re-fetching parts onto the same failing disk will reproduce the corruption and trigger another data loss event.

Prevention

Monitor ReplicatedPartChecksFailed and ReplicatedPartFailedFetches as leading indicators. Address them before they compound into ReplicatedDataLoss.
Monitor system.replication_queue for stuck entries with high num_tries. A queue entry that is not making progress is a data-loss risk.
Run periodic cross-replica row count and partition-level comparisons. Silent divergence produces zero queue entries and no standard replication alerts.
Do not treat ZooKeeper or ClickHouse Keeper as fire-and-forget infrastructure. Session expiry and coordination latency directly cascade into replication failures.
Investigate unexpected system.detached_parts immediately. They are often the first visible sign of disk or filesystem corruption.

How Netdata helps

Correlates ReplicatedDataLoss with ReplicatedPartFailedFetches and ReplicatedPartChecksFailed so you can see whether the loss followed a fetch degradation or an integrity failure.
Alerts on nonzero ReplicatedDataLoss.
Surfaces replication queue depth, stuck entries, and replica session state alongside the event to accelerate root cause analysis.
Correlates replication errors with disk I/O, network throughput, and ZooKeeper health signals to distinguish source replica pressure from coordination failures.
Tracks per-replica lag and availability, helping you identify which peer should serve as the recovery source.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse ReplicatedDataLoss > 0: detecting and responding to lost parts

ClickHouse ReplicatedDataLoss > 0: detecting and responding to lost parts

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If a healthy replica still has the part

If no replica has the part

Handle detached parts

Address underlying hardware

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata