ClickHouse merge duration climbing: the leading indicator of part explosion

system.merges shows elapsed times in hours. Your dashboards show P99 merge duration climbing over the past 48 hours. Rising merge duration is the earliest signal that your cluster is heading toward a part-count crisis, typically 1 to 3 days before inserts throttle or fail entirely.

Active part count is a lagging indicator. Merge duration rises while the merge subsystem still has work in progress. By the time part counts cross parts_to_delay_insert or parts_to_throw_insert, the backlog is severe. The operational skill is distinguishing legitimate long merges on large tables from systemic slowdowns that cascade into delayed or rejected inserts, then query degradation.

What this means

ClickHouse writes every INSERT as one or more immutable parts. The MergeTree engine continuously merges smaller parts into larger ones to reduce file count and index overhead.

Merge duration climbs when the background pool saturates, disk I/O is contested, or individual merges grow large enough to dominate threads. A single merge running for hours on a multi-terabyte table can be normal. The danger sign is a sustained increase in P99 merge duration across tables that previously completed similar-sized merges in minutes. Once merge throughput falls below the part creation rate, the system enters a debt spiral. ClickHouse first slows inserts at parts_to_delay_insert, then rejects them at parts_to_throw_insert. At that point the cluster is in a merge death spiral.

flowchart TD
    A[Inserts outpace merges] --> B[Merge duration climbs]
    B --> C[Active parts per partition rise]
    C --> D[Insert throttling begins]
    D --> E[Insert rejection: too many parts]
    E --> F[Write halt and query degradation]

Common causes

CauseWhat it looks likeFirst thing to check
Insert micro-batchingMany small inserts create parts faster than merges consolidate them. P99 merge duration climbs while active parts grow steadily.system.events for InsertQuery vs. InsertedRows ratio. Target 1,000+ rows per insert.
Mutation backlogALTER UPDATE/DELETE operations monopolize background pool threads. system.merges shows is_mutation = 1 consuming slots.SELECT * FROM system.mutations WHERE is_done = 0;
Disk I/O saturationMerge threads exist but throughput is low. Queries compete with merges for bandwidth.iostat -xz 1 5 and system.merges.progress stalling.
Background pool exhaustionMerge and mutation tasks fill the pool. New merges queue while parts accumulate.system.metrics for BackgroundMergesAndMutationsPoolTask near pool size.
Disk space pressureClickHouse halts merges when insufficient space exists to write the merged result before deleting sources.system.disks for unreserved_space approaching zero.
TTL merge competitionTables with TTL generate additional background merges that compete with regular merges for the same pool.system.merges during spikes to see if affected tables have TTL configured.

Quick checks

Run these read-only probes to assess current state.

# Currently running merges and elapsed time
clickhouse-client -q "
SELECT database, table, elapsed, progress, num_parts, is_mutation,
       formatReadableSize(total_size_bytes_compressed) AS total_size
FROM system.merges
ORDER BY elapsed DESC
LIMIT 10;
"
# P99 merge duration over last 24 hours (requires system.part_log)
clickhouse-client -q "
SELECT quantile(0.99)(duration_ms) / 1000 AS p99_merge_sec
FROM system.part_log
WHERE event_type = 'MergeParts'
  AND event_time > now() - INTERVAL 24 HOUR;
"
# Active parts per partition (where explosions happen)
clickhouse-client -q "
SELECT database, table, partition_id, count(*) AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;
"
# Mutations blocking the pool
clickhouse-client -q "
SELECT database, table, mutation_id, parts_to_do, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
"
# Background pool saturation
clickhouse-client -q "
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%';
"
# Disk space available for merge temp writes
clickhouse-client -q "
SELECT name, formatReadableSize(free_space) AS free,
       formatReadableSize(unreserved_space) AS unreserved
FROM system.disks;
"
# Insert batching health
clickhouse-client -q "
SELECT event, value
FROM system.events
WHERE event IN ('InsertQuery', 'InsertedRows');
"

How to diagnose it

  1. Baseline P99 merge duration. Query system.part_log for the past week. A climb from single-digit minutes toward hours is a red flag.
  2. Check live merge velocity. Sample system.merges twice, 60 seconds apart. If progress is flat for a merge with high elapsed, the merge is stuck or I/O bound.
  3. Map merges to tables and partitions. Correlate the tables with the longest elapsed against system.parts active counts.
  4. Check for mutation interference. If system.merges shows is_mutation = 1 entries consuming threads, query system.mutations for is_done = 0. A mutation with parts_to_do not decreasing is starving merges.
  5. Measure background pool saturation. Compare BackgroundMergesAndMutationsPoolTask to BackgroundMergesAndMutationsPoolSize in system.metrics. If the ratio exceeds 0.9 for more than 10 minutes, the pool is a bottleneck.
  6. Verify disk headroom. Merges require enough space to write the full merged result before source parts are unlinked. If system.disks.unreserved_space approaches zero, merges halt.
  7. Validate insert patterns. If InsertedRows / InsertQuery in system.events yields fewer than 1,000 rows per insert, the application is micro-batching.
  8. Distinguish normal from abnormal. A merge with num_parts = 20 and total_size_bytes_compressed in the hundreds of gigabytes legitimately takes hours. A merge with num_parts = 3 and small footprint taking over an hour indicates a slowdown.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
P99 merge duration (system.part_log)Leading indicator of merge backlogTrending from minutes toward hours over 24-48 hours
Active parts per partition (system.parts)Distance to insert throttling and rejectionSustained growth above 500 per partition
Background merge pool utilization (system.metrics)Whether merge threads are saturatedBackgroundMergesAndMutationsPoolTask / pool size > 0.9 for > 10 min
Mutation queue depth (system.mutations)Hidden consumers of merge capacityAny is_done = 0 with parts_to_do flat for > 30 min
Disk unreserved space (system.disks)Merges halt without temp spaceApproaching zero
Insert delay/rejection counters (system.events)Late-stage symptoms of merge debtDelayedInserts or RejectedInserts increasing
Merge progress velocity (system.merges)Real-time merge healthprogress flat across a 60-second sample

Fixes

Reduce part creation rate. Throttle insertors or increase batch sizes to 1,000+ rows per INSERT. This is the safest first step and buys immediate runway.

Kill blocking mutations. Warning: KILL MUTATION aborts the running mutation. If non-critical ALTER UPDATE/DELETE mutations are consuming the pool, use KILL MUTATION to free threads.

Increase merge concurrency. If CPU and disk I/O have headroom, raise background_merges_mutations_concurrency_ratio. Changes do not affect merges already in flight.

Reclaim disk space. If unreserved_space is low, drop old partitions with ALTER TABLE ... DROP PARTITION or add capacity. Warning: DROP PARTITION is irreversible. DETACH PARTITION alone moves data to the detached/ folder and does not free space.

Separate merge and mutation pools. On newer versions, merges and mutations can use separate pools. If mutations are a chronic issue, configure isolation so they cannot starve regular merges.

Review partition granularity. Over-partitioning (for example, by hour instead of day) multiplies part counts. Changing partition keys requires rewriting data but prevents recurrence.

Prevention

  • Monitor P99 merge duration weekly using system.part_log. Trend it before it becomes an incident.
  • Keep active parts per partition below 50% of parts_to_delay_insert during normal operations.
  • Enforce insert batching at the application layer. Single-row inserts are an anti-pattern in ClickHouse.
  • Review system.mutations before it becomes a queue. Avoid treating ClickHouse like an OLTP store.
  • Maintain disk headroom so merges always have temp space.
  • Alert on background pool utilization trends, not just instantaneous spikes.

How Netdata helps

  • Correlate climbing merge duration with part-count growth, disk I/O latency, and insert rate on unified timelines.
  • Alert on derived metrics like part-count growth rate and merge pool utilization without manual polling of system.metrics.
  • Visualize P99 merge duration trends automatically.
  • Surface host-level disk latency and CPU alongside ClickHouse internal signals.