$ guides / clickhouse / clickhouse-replication-lag ▌

Operations Guides

ClickHouse replication lag: absolute_delay, queue_size, and catch-up diagnosis

You notice absolute_delay climbing on one replica. SELECTs there return older rows than on peers, and failover to the lagging node risks losing recently inserted data. In ClickHouse, replication lag is not a single failure; it is a symptom with several distinct causes. A replica can fall behind because it cannot pull entries from the Keeper log, because it cannot fetch parts fast enough, because a mutation is blocking the queue, or because the source replica itself is too slow to serve data.

This guide shows how to use system.replicas, system.replication_queue, and background pool metrics to separate a transient catch-up from a structural bottleneck. The goal is to answer two questions fast: where is the replica stuck, and will it recover on its own?

What this means

Three signals tell most of the story:

absolute_delay in system.replicas is wall-clock seconds of insert freshness lost. It rises when the replica cannot attach recent parts. During normal operation it should stay near zero. Sustained values above 120 seconds warrant investigation, and values above 300 seconds are abnormal.
queue_size counts operations the replica has already pulled from the Keeper replication log but has not executed yet. It includes fetches (GET_PART), local merges (MERGE_PARTS), mutations (MUTATE_PART), and drop ranges.
log_max_index - log_pointer is the number of unreplicated log entries the replica has not yet pulled. If this delta is large while queue_size is small, the replica has stopped consuming the log, usually due to a lost Keeper session or a readonly state. If queue_size is large while the log delta is small, the replica is pulling entries but cannot execute them.

flowchart TD
    A[absolute_delay rising] --> B{queue_size high?}
    B -->|yes| C{log_max_index - log_pointer high?}
    B -->|no| D[Check is_readonly and is_session_expired]
    C -->|yes| E[Pulling but falling behind: network, source slow, fetch pool]
    C -->|no| F[Executing but blocked: mutations, disk, fetch pool]
    D --> G[Recover Keeper session or replica state]
    E --> H[Inspect network and source replica]
    F --> I[Inspect replication_queue and background pools]

Common causes

Cause	What it looks like	First thing to check
Fetch pool saturation	`queue_size` growing, `log_max_index - log_pointer` small, many `GET_PART` entries waiting	`system.metrics` for `BackgroundFetchesPoolTask` / `BackgroundFetchesPoolSize`
Network bandwidth bottleneck	Replication traffic saturates the link, fetches progress slowly, interserver latency high	OS network counters and `InterserverConnection` count
Heavy mutation blocking replication	`MUTATE_PART` entries dominate the queue or a mutation consumes the merge/fetch pipeline	`system.mutations` where `is_done = 0`
Source replica slow or starved	`last_exception` references source replica timeouts or missing parts	Health of the replica listed in `source_replica`
Keeper session loss or readonly state	`is_readonly = 1` or `is_session_expired = 1`, large log delta with small queue	`system.zookeeper_connection` and replica session state
Insufficient disk space for incoming parts	Fetches cannot land, merges halt, disk free space low	`system.disks` free space and part growth

Quick checks

Run these in order. All are read-only.

-- Replication lag overview
SELECT
    database,
    table,
    absolute_delay,
    queue_size,
    log_max_index - log_pointer AS entries_behind,
    is_readonly,
    is_session_expired
FROM system.replicas
WHERE engine LIKE '%Replicated%'
ORDER BY absolute_delay DESC;

-- Stuck or failing replication tasks
SELECT
    database,
    table,
    type,
    source_replica,
    num_tries,
    last_exception,
    create_time,
    last_attempt_time
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;

-- Background pool utilization
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%'
ORDER BY metric;

-- Active mutations competing with replication
SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    parts_to_do,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;

-- Keeper connection state
SELECT
    name,
    host,
    port,
    is_expired,
    session_uptime_elapsed_seconds,
    session_timeout_ms
FROM system.zookeeper_connection;

-- Disk space for incoming parts and merges
SELECT
    name,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(total_space) AS total,
    round(100 * (1 - free_space / total_space), 1) AS used_pct
FROM system.disks;

-- Replication error counters
SELECT event, value
FROM system.events
WHERE event IN (
    'ReplicatedPartFailedFetches',
    'ReplicatedPartChecksFailed',
    'ReplicatedDataLoss'
);

How to diagnose it

Confirm the lag is real and growing. Sample absolute_delay twice over 60-120 seconds. If it is flat or decreasing, the replica is catching up. If it is increasing, the replica is falling further behind.
Locate the bottleneck between pull and execute. Compare queue_size with entries_behind (log_max_index - log_pointer).
- Large entries_behind + small queue_size = the replica is not pulling from the log. Investigate Keeper connectivity and is_session_expired.
- Small entries_behind + large queue_size = the replica is pulling but cannot execute. Investigate background pools, mutations, disk, and network.
Inspect the replication queue for stuck entries. num_tries > 0 with a non-empty last_exception means an operation is failing and retrying. Note the type (GET_PART, MERGE_PARTS, MUTATE_PART) and the source_replica. One stuck entry can block later entries for the same partition.
Check background pool saturation. In system.metrics, compare BackgroundFetchesPoolTask to BackgroundFetchesPoolSize. If the fetch pool is consistently near capacity while GET_PART tasks are queued, fetches are the bottleneck. Also check BackgroundMergesAndMutationsPoolTask against its size; mutations can starve merges and indirectly slow replication.
Look for heavy mutations. In system.mutations, any row with is_done = 0 and a large or unchanging parts_to_do is consuming background resources. A slow mutation serializes work per part and can stall GET_PART operations behind it.
Validate the source replica. For stuck GET_PART entries, check the health of the source replica listed in system.replication_queue. If the source is in a merge death spiral, out of disk space, or has high query load, it cannot serve parts fast enough.
Check disk and Keeper health. Even a healthy replica cannot catch up if system.disks shows low free space, or if system.zookeeper_connection shows is_expired = 1. Merges need free space to write output before deleting sources, and fetches need space to land new parts.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`absolute_delay`	Measures wall-clock data freshness on the replica	Sustained > 120 seconds, or > 300 seconds at any time
`queue_size`	Counts pending replication work already pulled from the log	Growing steadily for > 15 minutes
`log_max_index - log_pointer`	Shows whether the replica is consuming the Keeper log	Large and growing while `queue_size` stays small
`system.replication_queue` entries with `num_tries > 0`	Identifies operations that are failing instead of making progress	Any entry with `num_tries > 5` and non-empty `last_exception`
`BackgroundFetchesPoolTask` / `BackgroundFetchesPoolSize`	Reveals fetch pool saturation	Ratio sustained above ~0.9 while `GET_PART` queue grows
`is_readonly` / `is_session_expired`	Indicates the replica has lost write coordination	Any non-zero value sustained longer than a brief reconnect
`ReplicatedPartFailedFetches`	Tracks fetch failures that can silently accumulate	Sustained positive rate, not just a restart blip

Fixes

Fetch pool saturation

If the fetch pool is fully utilized and GET_PART entries are piling up, the replica cannot download parts fast enough. Temporary relief options:

Throttle insert load on the source replicas so fewer new parts need to be fetched.
If the cluster has headroom, increase the fetch pool capacity. The setting that controls this is server-level and may require a restart depending on your ClickHouse version; verify in your version documentation before changing it.
For a replica that is very far behind, consider temporarily redirecting reads away from it so catch-up traffic does not compete with user queries for the same disk and network.

Tradeoff: increasing fetch threads raises CPU, disk, and network load on both source and target replicas.

Network bandwidth bottleneck

If interserver throughput is at the link capacity:

Schedule large catch-ups during low-traffic windows.
Reduce the number of concurrent fetches to avoid saturating the link with many slow transfers.
If possible, provision higher bandwidth between replicas or place replicas in the same availability zone.

Tradeoff: fewer concurrent fetches mean slower catch-up but more predictable latency for distributed queries.

Heavy mutation blocking replication

If a mutation is monopolizing background resources:

Use KILL MUTATION for non-critical mutations. The partially completed work is discarded.
For very large tables, break mutations into smaller time ranges or partitions so each mutation finishes faster.
Avoid issuing mutations during peak insert hours.

Tradeoff: killed mutations must be reissued later, but they restore replication throughput immediately.

Source replica slow or starved

If the source replica cannot serve parts:

Diagnose the source using the same part-count, merge, memory, and disk signals you would use for any node.
Do not restart the lagging replica as a first fix; that only increases catch-up work.
If the source is in a merge death spiral or out of disk space, fix the source first.

Keeper session loss or readonly state

If is_session_expired = 1 or is_readonly = 1:

Check Keeper ensemble health independently with ruok and mntr commands.
Pause non-critical DDL operations to reduce Keeper load.
For a replica stuck in readonly after a session issue, SYSTEM RESTART REPLICA can force a re-check against Keeper state. If parts are genuinely lost, SYSTEM RESTORE REPLICA re-initializes the replica from Keeper and re-fetches all parts, which is destructive and I/O intensive.

Tradeoff: SYSTEM RESTORE REPLICA restores consistency but generates a large catch-up fetch load across the network.

Insufficient disk space

If free space is low:

Identify the largest tables and detach old partitions if retention policy allows.
Verify that TTL rules are actually executing; TTL cleanup depends on merges, and stalled merges leave expired data in place.
Add capacity before the disk hits the hard stop.

Prevention

Alert on absolute_delay sustained above 120 seconds and on queue_size growth rate, not just absolute depth.
Monitor the ratio of BackgroundFetchesPoolTask to BackgroundFetchesPoolSize so fetch saturation is visible before lag spikes.
Keep mutations rare and scheduled outside peak insert windows. Monitor system.mutations for is_done = 0 proactively.
Maintain disk free space well above the largest partition size; merges and fetches need headroom.
Monitor Keeper latency and session state as a first-class dependency for any replicated deployment.
After node restarts or failovers, watch the replication queue trend for at least 15 minutes to confirm catch-up is happening.

How Netdata helps

Correlate absolute_delay and queue_size with BackgroundFetchesPoolTask, interserver network throughput, and disk I/O on the same charts to see whether lag is caused by CPU, network, or disk.
Surface Keeper session flaps and is_readonly state changes alongside replication lag so a coordination failure does not look like a data problem.
Track per-query memory and mutation progress alongside replication metrics to catch the mutation-blocking-replication pattern.
Alert on sustained replication lag and fetch pool saturation before reads become stale or failovers become unsafe.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse replication lag: absolute_delay, queue_size, and catch-up diagnosis

ClickHouse replication lag: absolute_delay, queue_size, and catch-up diagnosis

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Fetch pool saturation

Network bandwidth bottleneck

Heavy mutation blocking replication

Source replica slow or starved

Keeper session loss or readonly state

Insufficient disk space

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata