ClickHouse replication lag: absolute_delay, queue_size, and catch-up diagnosis
You notice absolute_delay climbing on one replica. SELECTs there return older rows than on peers, and failover to the lagging node risks losing recently inserted data. In ClickHouse, replication lag is not a single failure; it is a symptom with several distinct causes. A replica can fall behind because it cannot pull entries from the Keeper log, because it cannot fetch parts fast enough, because a mutation is blocking the queue, or because the source replica itself is too slow to serve data.
This guide shows how to use system.replicas, system.replication_queue, and background pool metrics to separate a transient catch-up from a structural bottleneck. The goal is to answer two questions fast: where is the replica stuck, and will it recover on its own?
What this means
Three signals tell most of the story:
absolute_delayinsystem.replicasis wall-clock seconds of insert freshness lost. It rises when the replica cannot attach recent parts. During normal operation it should stay near zero. Sustained values above 120 seconds warrant investigation, and values above 300 seconds are abnormal.queue_sizecounts operations the replica has already pulled from the Keeper replication log but has not executed yet. It includes fetches (GET_PART), local merges (MERGE_PARTS), mutations (MUTATE_PART), and drop ranges.log_max_index - log_pointeris the number of unreplicated log entries the replica has not yet pulled. If this delta is large whilequeue_sizeis small, the replica has stopped consuming the log, usually due to a lost Keeper session or a readonly state. Ifqueue_sizeis large while the log delta is small, the replica is pulling entries but cannot execute them.
flowchart TD
A[absolute_delay rising] --> B{queue_size high?}
B -->|yes| C{log_max_index - log_pointer high?}
B -->|no| D[Check is_readonly and is_session_expired]
C -->|yes| E[Pulling but falling behind: network, source slow, fetch pool]
C -->|no| F[Executing but blocked: mutations, disk, fetch pool]
D --> G[Recover Keeper session or replica state]
E --> H[Inspect network and source replica]
F --> I[Inspect replication_queue and background pools]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Fetch pool saturation | queue_size growing, log_max_index - log_pointer small, many GET_PART entries waiting | system.metrics for BackgroundFetchesPoolTask / BackgroundFetchesPoolSize |
| Network bandwidth bottleneck | Replication traffic saturates the link, fetches progress slowly, interserver latency high | OS network counters and InterserverConnection count |
| Heavy mutation blocking replication | MUTATE_PART entries dominate the queue or a mutation consumes the merge/fetch pipeline | system.mutations where is_done = 0 |
| Source replica slow or starved | last_exception references source replica timeouts or missing parts | Health of the replica listed in source_replica |
| Keeper session loss or readonly state | is_readonly = 1 or is_session_expired = 1, large log delta with small queue | system.zookeeper_connection and replica session state |
| Insufficient disk space for incoming parts | Fetches cannot land, merges halt, disk free space low | system.disks free space and part growth |
Quick checks
Run these in order. All are read-only.
-- Replication lag overview
SELECT
database,
table,
absolute_delay,
queue_size,
log_max_index - log_pointer AS entries_behind,
is_readonly,
is_session_expired
FROM system.replicas
WHERE engine LIKE '%Replicated%'
ORDER BY absolute_delay DESC;
-- Stuck or failing replication tasks
SELECT
database,
table,
type,
source_replica,
num_tries,
last_exception,
create_time,
last_attempt_time
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
-- Background pool utilization
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%'
ORDER BY metric;
-- Active mutations competing with replication
SELECT
database,
table,
mutation_id,
command,
create_time,
parts_to_do,
latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Keeper connection state
SELECT
name,
host,
port,
is_expired,
session_uptime_elapsed_seconds,
session_timeout_ms
FROM system.zookeeper_connection;
-- Disk space for incoming parts and merges
SELECT
name,
path,
formatReadableSize(free_space) AS free,
formatReadableSize(total_space) AS total,
round(100 * (1 - free_space / total_space), 1) AS used_pct
FROM system.disks;
-- Replication error counters
SELECT event, value
FROM system.events
WHERE event IN (
'ReplicatedPartFailedFetches',
'ReplicatedPartChecksFailed',
'ReplicatedDataLoss'
);
How to diagnose it
Confirm the lag is real and growing. Sample
absolute_delaytwice over 60-120 seconds. If it is flat or decreasing, the replica is catching up. If it is increasing, the replica is falling further behind.Locate the bottleneck between pull and execute. Compare
queue_sizewithentries_behind(log_max_index - log_pointer).- Large
entries_behind+ smallqueue_size= the replica is not pulling from the log. Investigate Keeper connectivity andis_session_expired. - Small
entries_behind+ largequeue_size= the replica is pulling but cannot execute. Investigate background pools, mutations, disk, and network.
- Large
Inspect the replication queue for stuck entries.
num_tries > 0with a non-emptylast_exceptionmeans an operation is failing and retrying. Note thetype(GET_PART,MERGE_PARTS,MUTATE_PART) and thesource_replica. One stuck entry can block later entries for the same partition.Check background pool saturation. In
system.metrics, compareBackgroundFetchesPoolTasktoBackgroundFetchesPoolSize. If the fetch pool is consistently near capacity whileGET_PARTtasks are queued, fetches are the bottleneck. Also checkBackgroundMergesAndMutationsPoolTaskagainst its size; mutations can starve merges and indirectly slow replication.Look for heavy mutations. In
system.mutations, any row withis_done = 0and a large or unchangingparts_to_dois consuming background resources. A slow mutation serializes work per part and can stallGET_PARToperations behind it.Validate the source replica. For stuck
GET_PARTentries, check the health of the source replica listed insystem.replication_queue. If the source is in a merge death spiral, out of disk space, or has high query load, it cannot serve parts fast enough.Check disk and Keeper health. Even a healthy replica cannot catch up if
system.disksshows low free space, or ifsystem.zookeeper_connectionshowsis_expired = 1. Merges need free space to write output before deleting sources, and fetches need space to land new parts.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
absolute_delay | Measures wall-clock data freshness on the replica | Sustained > 120 seconds, or > 300 seconds at any time |
queue_size | Counts pending replication work already pulled from the log | Growing steadily for > 15 minutes |
log_max_index - log_pointer | Shows whether the replica is consuming the Keeper log | Large and growing while queue_size stays small |
system.replication_queue entries with num_tries > 0 | Identifies operations that are failing instead of making progress | Any entry with num_tries > 5 and non-empty last_exception |
BackgroundFetchesPoolTask / BackgroundFetchesPoolSize | Reveals fetch pool saturation | Ratio sustained above ~0.9 while GET_PART queue grows |
is_readonly / is_session_expired | Indicates the replica has lost write coordination | Any non-zero value sustained longer than a brief reconnect |
ReplicatedPartFailedFetches | Tracks fetch failures that can silently accumulate | Sustained positive rate, not just a restart blip |
Fixes
Fetch pool saturation
If the fetch pool is fully utilized and GET_PART entries are piling up, the replica cannot download parts fast enough. Temporary relief options:
- Throttle insert load on the source replicas so fewer new parts need to be fetched.
- If the cluster has headroom, increase the fetch pool capacity. The setting that controls this is server-level and may require a restart depending on your ClickHouse version; verify in your version documentation before changing it.
- For a replica that is very far behind, consider temporarily redirecting reads away from it so catch-up traffic does not compete with user queries for the same disk and network.
Tradeoff: increasing fetch threads raises CPU, disk, and network load on both source and target replicas.
Network bandwidth bottleneck
If interserver throughput is at the link capacity:
- Schedule large catch-ups during low-traffic windows.
- Reduce the number of concurrent fetches to avoid saturating the link with many slow transfers.
- If possible, provision higher bandwidth between replicas or place replicas in the same availability zone.
Tradeoff: fewer concurrent fetches mean slower catch-up but more predictable latency for distributed queries.
Heavy mutation blocking replication
If a mutation is monopolizing background resources:
- Use
KILL MUTATIONfor non-critical mutations. The partially completed work is discarded. - For very large tables, break mutations into smaller time ranges or partitions so each mutation finishes faster.
- Avoid issuing mutations during peak insert hours.
Tradeoff: killed mutations must be reissued later, but they restore replication throughput immediately.
Source replica slow or starved
If the source replica cannot serve parts:
- Diagnose the source using the same part-count, merge, memory, and disk signals you would use for any node.
- Do not restart the lagging replica as a first fix; that only increases catch-up work.
- If the source is in a merge death spiral or out of disk space, fix the source first.
Keeper session loss or readonly state
If is_session_expired = 1 or is_readonly = 1:
- Check Keeper ensemble health independently with
ruokandmntrcommands. - Pause non-critical DDL operations to reduce Keeper load.
- For a replica stuck in readonly after a session issue,
SYSTEM RESTART REPLICAcan force a re-check against Keeper state. If parts are genuinely lost,SYSTEM RESTORE REPLICAre-initializes the replica from Keeper and re-fetches all parts, which is destructive and I/O intensive.
Tradeoff: SYSTEM RESTORE REPLICA restores consistency but generates a large catch-up fetch load across the network.
Insufficient disk space
If free space is low:
- Identify the largest tables and detach old partitions if retention policy allows.
- Verify that TTL rules are actually executing; TTL cleanup depends on merges, and stalled merges leave expired data in place.
- Add capacity before the disk hits the hard stop.
Prevention
- Alert on
absolute_delaysustained above 120 seconds and onqueue_sizegrowth rate, not just absolute depth. - Monitor the ratio of
BackgroundFetchesPoolTasktoBackgroundFetchesPoolSizeso fetch saturation is visible before lag spikes. - Keep mutations rare and scheduled outside peak insert windows. Monitor
system.mutationsforis_done = 0proactively. - Maintain disk free space well above the largest partition size; merges and fetches need headroom.
- Monitor Keeper latency and session state as a first-class dependency for any replicated deployment.
- After node restarts or failovers, watch the replication queue trend for at least 15 minutes to confirm catch-up is happening.
How Netdata helps
- Correlate
absolute_delayandqueue_sizewithBackgroundFetchesPoolTask, interserver network throughput, and disk I/O on the same charts to see whether lag is caused by CPU, network, or disk. - Surface Keeper session flaps and
is_readonlystate changes alongside replication lag so a coordination failure does not look like a data problem. - Track per-query memory and mutation progress alongside replication metrics to catch the mutation-blocking-replication pattern.
- Alert on sustained replication lag and fetch pool saturation before reads become stale or failovers become unsafe.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool







