ClickHouse replication lag: absolute_delay, queue_size, and catch-up diagnosis

You notice absolute_delay climbing on one replica. SELECTs there return older rows than on peers, and failover to the lagging node risks losing recently inserted data. In ClickHouse, replication lag is not a single failure; it is a symptom with several distinct causes. A replica can fall behind because it cannot pull entries from the Keeper log, because it cannot fetch parts fast enough, because a mutation is blocking the queue, or because the source replica itself is too slow to serve data.

This guide shows how to use system.replicas, system.replication_queue, and background pool metrics to separate a transient catch-up from a structural bottleneck. The goal is to answer two questions fast: where is the replica stuck, and will it recover on its own?

What this means

Three signals tell most of the story:

  • absolute_delay in system.replicas is wall-clock seconds of insert freshness lost. It rises when the replica cannot attach recent parts. During normal operation it should stay near zero. Sustained values above 120 seconds warrant investigation, and values above 300 seconds are abnormal.
  • queue_size counts operations the replica has already pulled from the Keeper replication log but has not executed yet. It includes fetches (GET_PART), local merges (MERGE_PARTS), mutations (MUTATE_PART), and drop ranges.
  • log_max_index - log_pointer is the number of unreplicated log entries the replica has not yet pulled. If this delta is large while queue_size is small, the replica has stopped consuming the log, usually due to a lost Keeper session or a readonly state. If queue_size is large while the log delta is small, the replica is pulling entries but cannot execute them.
flowchart TD
    A[absolute_delay rising] --> B{queue_size high?}
    B -->|yes| C{log_max_index - log_pointer high?}
    B -->|no| D[Check is_readonly and is_session_expired]
    C -->|yes| E[Pulling but falling behind: network, source slow, fetch pool]
    C -->|no| F[Executing but blocked: mutations, disk, fetch pool]
    D --> G[Recover Keeper session or replica state]
    E --> H[Inspect network and source replica]
    F --> I[Inspect replication_queue and background pools]

Common causes

CauseWhat it looks likeFirst thing to check
Fetch pool saturationqueue_size growing, log_max_index - log_pointer small, many GET_PART entries waitingsystem.metrics for BackgroundFetchesPoolTask / BackgroundFetchesPoolSize
Network bandwidth bottleneckReplication traffic saturates the link, fetches progress slowly, interserver latency highOS network counters and InterserverConnection count
Heavy mutation blocking replicationMUTATE_PART entries dominate the queue or a mutation consumes the merge/fetch pipelinesystem.mutations where is_done = 0
Source replica slow or starvedlast_exception references source replica timeouts or missing partsHealth of the replica listed in source_replica
Keeper session loss or readonly stateis_readonly = 1 or is_session_expired = 1, large log delta with small queuesystem.zookeeper_connection and replica session state
Insufficient disk space for incoming partsFetches cannot land, merges halt, disk free space lowsystem.disks free space and part growth

Quick checks

Run these in order. All are read-only.

-- Replication lag overview
SELECT
    database,
    table,
    absolute_delay,
    queue_size,
    log_max_index - log_pointer AS entries_behind,
    is_readonly,
    is_session_expired
FROM system.replicas
WHERE engine LIKE '%Replicated%'
ORDER BY absolute_delay DESC;
-- Stuck or failing replication tasks
SELECT
    database,
    table,
    type,
    source_replica,
    num_tries,
    last_exception,
    create_time,
    last_attempt_time
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
-- Background pool utilization
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%'
ORDER BY metric;
-- Active mutations competing with replication
SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    parts_to_do,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Keeper connection state
SELECT
    name,
    host,
    port,
    is_expired,
    session_uptime_elapsed_seconds,
    session_timeout_ms
FROM system.zookeeper_connection;
-- Disk space for incoming parts and merges
SELECT
    name,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(total_space) AS total,
    round(100 * (1 - free_space / total_space), 1) AS used_pct
FROM system.disks;
-- Replication error counters
SELECT event, value
FROM system.events
WHERE event IN (
    'ReplicatedPartFailedFetches',
    'ReplicatedPartChecksFailed',
    'ReplicatedDataLoss'
);

How to diagnose it

  1. Confirm the lag is real and growing. Sample absolute_delay twice over 60-120 seconds. If it is flat or decreasing, the replica is catching up. If it is increasing, the replica is falling further behind.

  2. Locate the bottleneck between pull and execute. Compare queue_size with entries_behind (log_max_index - log_pointer).

    • Large entries_behind + small queue_size = the replica is not pulling from the log. Investigate Keeper connectivity and is_session_expired.
    • Small entries_behind + large queue_size = the replica is pulling but cannot execute. Investigate background pools, mutations, disk, and network.
  3. Inspect the replication queue for stuck entries. num_tries > 0 with a non-empty last_exception means an operation is failing and retrying. Note the type (GET_PART, MERGE_PARTS, MUTATE_PART) and the source_replica. One stuck entry can block later entries for the same partition.

  4. Check background pool saturation. In system.metrics, compare BackgroundFetchesPoolTask to BackgroundFetchesPoolSize. If the fetch pool is consistently near capacity while GET_PART tasks are queued, fetches are the bottleneck. Also check BackgroundMergesAndMutationsPoolTask against its size; mutations can starve merges and indirectly slow replication.

  5. Look for heavy mutations. In system.mutations, any row with is_done = 0 and a large or unchanging parts_to_do is consuming background resources. A slow mutation serializes work per part and can stall GET_PART operations behind it.

  6. Validate the source replica. For stuck GET_PART entries, check the health of the source replica listed in system.replication_queue. If the source is in a merge death spiral, out of disk space, or has high query load, it cannot serve parts fast enough.

  7. Check disk and Keeper health. Even a healthy replica cannot catch up if system.disks shows low free space, or if system.zookeeper_connection shows is_expired = 1. Merges need free space to write output before deleting sources, and fetches need space to land new parts.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
absolute_delayMeasures wall-clock data freshness on the replicaSustained > 120 seconds, or > 300 seconds at any time
queue_sizeCounts pending replication work already pulled from the logGrowing steadily for > 15 minutes
log_max_index - log_pointerShows whether the replica is consuming the Keeper logLarge and growing while queue_size stays small
system.replication_queue entries with num_tries > 0Identifies operations that are failing instead of making progressAny entry with num_tries > 5 and non-empty last_exception
BackgroundFetchesPoolTask / BackgroundFetchesPoolSizeReveals fetch pool saturationRatio sustained above ~0.9 while GET_PART queue grows
is_readonly / is_session_expiredIndicates the replica has lost write coordinationAny non-zero value sustained longer than a brief reconnect
ReplicatedPartFailedFetchesTracks fetch failures that can silently accumulateSustained positive rate, not just a restart blip

Fixes

Fetch pool saturation

If the fetch pool is fully utilized and GET_PART entries are piling up, the replica cannot download parts fast enough. Temporary relief options:

  • Throttle insert load on the source replicas so fewer new parts need to be fetched.
  • If the cluster has headroom, increase the fetch pool capacity. The setting that controls this is server-level and may require a restart depending on your ClickHouse version; verify in your version documentation before changing it.
  • For a replica that is very far behind, consider temporarily redirecting reads away from it so catch-up traffic does not compete with user queries for the same disk and network.

Tradeoff: increasing fetch threads raises CPU, disk, and network load on both source and target replicas.

Network bandwidth bottleneck

If interserver throughput is at the link capacity:

  • Schedule large catch-ups during low-traffic windows.
  • Reduce the number of concurrent fetches to avoid saturating the link with many slow transfers.
  • If possible, provision higher bandwidth between replicas or place replicas in the same availability zone.

Tradeoff: fewer concurrent fetches mean slower catch-up but more predictable latency for distributed queries.

Heavy mutation blocking replication

If a mutation is monopolizing background resources:

  • Use KILL MUTATION for non-critical mutations. The partially completed work is discarded.
  • For very large tables, break mutations into smaller time ranges or partitions so each mutation finishes faster.
  • Avoid issuing mutations during peak insert hours.

Tradeoff: killed mutations must be reissued later, but they restore replication throughput immediately.

Source replica slow or starved

If the source replica cannot serve parts:

  • Diagnose the source using the same part-count, merge, memory, and disk signals you would use for any node.
  • Do not restart the lagging replica as a first fix; that only increases catch-up work.
  • If the source is in a merge death spiral or out of disk space, fix the source first.

Keeper session loss or readonly state

If is_session_expired = 1 or is_readonly = 1:

  • Check Keeper ensemble health independently with ruok and mntr commands.
  • Pause non-critical DDL operations to reduce Keeper load.
  • For a replica stuck in readonly after a session issue, SYSTEM RESTART REPLICA can force a re-check against Keeper state. If parts are genuinely lost, SYSTEM RESTORE REPLICA re-initializes the replica from Keeper and re-fetches all parts, which is destructive and I/O intensive.

Tradeoff: SYSTEM RESTORE REPLICA restores consistency but generates a large catch-up fetch load across the network.

Insufficient disk space

If free space is low:

  • Identify the largest tables and detach old partitions if retention policy allows.
  • Verify that TTL rules are actually executing; TTL cleanup depends on merges, and stalled merges leave expired data in place.
  • Add capacity before the disk hits the hard stop.

Prevention

  • Alert on absolute_delay sustained above 120 seconds and on queue_size growth rate, not just absolute depth.
  • Monitor the ratio of BackgroundFetchesPoolTask to BackgroundFetchesPoolSize so fetch saturation is visible before lag spikes.
  • Keep mutations rare and scheduled outside peak insert windows. Monitor system.mutations for is_done = 0 proactively.
  • Maintain disk free space well above the largest partition size; merges and fetches need headroom.
  • Monitor Keeper latency and session state as a first-class dependency for any replicated deployment.
  • After node restarts or failovers, watch the replication queue trend for at least 15 minutes to confirm catch-up is happening.

How Netdata helps

  • Correlate absolute_delay and queue_size with BackgroundFetchesPoolTask, interserver network throughput, and disk I/O on the same charts to see whether lag is caused by CPU, network, or disk.
  • Surface Keeper session flaps and is_readonly state changes alongside replication lag so a coordination failure does not look like a data problem.
  • Track per-query memory and mutation progress alongside replication metrics to catch the mutation-blocking-replication pattern.
  • Alert on sustained replication lag and fetch pool saturation before reads become stale or failovers become unsafe.