$ guides / clickhouse / clickhouse-background-pool-saturation ▌

Operations Guides

ClickHouse background pool saturation: when merges and mutations starve

Insert latency is climbing and DelayedInserts is ticking up. system.merges shows every slot occupied, yet system.parts keeps growing. When the background merge and mutation pool saturates, new merges queue instead of starting, parts accumulate, and the distance to insert rejections shrinks fast. Distinguish true thread starvation from I/O-bound stalls, identify when mutations are the culprit, and relieve pressure before inserts fail.

What this means

ClickHouse consolidates immutable parts using a dedicated thread pool tracked in system.metrics as BackgroundMergesAndMutationsPoolTask. Concurrency is controlled by background_merges_mutations_concurrency_ratio (default 2); the older background_pool_size is deprecated. Merges and mutations share this pool. Separate pools handle fetches, moves, and distributed sends.

Pool saturation means every thread is occupied. The effects are:

New merges queue.
Merge completion rate drops below part creation rate.
Part count per partition rises.
Inserts delay once parts_to_delay_insert is crossed, then reject at parts_to_throw_insert.

Causes include too much work (mutations, many small inserts), too little concurrency, or threads stuck on disk I/O. The symptoms look identical in pool metrics, but the fixes differ.

flowchart TD
    A[Part count growing] --> B{Pool saturated?}
    B -->|Yes| C{Mutations running?}
    B -->|No| D[Check insert batch size and partition key]
    C -->|Yes| E[Kill or wait for mutations]
    C -->|No| F{I/O await high?}
    F -->|Yes| G[Reduce competing I/O or add capacity]
    F -->|No| H[Increase concurrency ratio or throttle inserts]

Common causes

Cause	What it looks like	First thing to check
Mutation backlog	`system.merges` shows `is_mutation = 1` occupying slots; `system.mutations` has `is_done = 0` with slow or flat `parts_to_do`	`SELECT count(*) FROM system.mutations WHERE is_done = 0`
Insert rate exceeds merge throughput	Steady part-count growth, pool at 100%, no mutations, merges running but not fast enough	`SELECT event, value FROM system.events WHERE event IN ('InsertQuery','InsertedRows')`
I/O saturation masquerading as thread starvation	Pool threads are active but merge `progress` is frozen or crawling; disk await is high	`iostat -xz 1 5` and `SELECT progress, elapsed FROM system.merges`
Single large merge hogging slots	One entry in `system.merges` with very large `num_parts`, `elapsed` over an hour, consuming most memory	`SELECT elapsed, num_parts, memory_usage FROM system.merges ORDER BY elapsed DESC`
Many small inserts creating unmergeable overhead	High `InsertQuery` count but low `InsertedRows` per query; part count spikes faster than usual	Rows per insert from `system.events` deltas

Quick checks

-- Check background pool utilization
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%'
ORDER BY metric;

-- Check active merges and whether they are mutations
SELECT
    database,
    table,
    elapsed,
    progress,
    num_parts,
    is_mutation,
    formatReadableSize(memory_usage) AS mem
FROM system.merges
ORDER BY elapsed DESC;

-- Check for incomplete mutations
SELECT
    database,
    table,
    mutation_id,
    parts_to_do,
    is_done,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;

-- Check active parts per partition (worst offenders)
SELECT
    database,
    table,
    partition_id,
    count(*) AS parts
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY parts DESC
LIMIT 10;

-- Check insert pressure and early warnings
SELECT event, value
FROM system.events
WHERE event IN ('DelayedInserts', 'RejectedInserts', 'InsertQuery', 'InsertedRows');

# Check disk I/O latency during the incident
iostat -xz 1 5

How to diagnose it

Confirm pool saturation. Query system.metrics for BackgroundMergesAndMutationsPoolTask and BackgroundMergesAndMutationsPoolSize. A sustained ratio at or near 1.0 for more than 10 minutes during active inserts means saturation.
See what occupies the slots. Query system.merges. If is_mutation = 1, mutations are competing directly with merges for threads.
Check mutation backlog. Query system.mutations for is_done = 0. If parts_to_do is flat or failing to decrease, a mutation is stuck or very large.
Correlate with part-count growth. Query system.parts grouped by partition_id. If the max per partition rises while the pool is full, merge throughput is the bottleneck.
Distinguish I/O-bound from CPU/thread-bound stalls. In system.merges, sample progress twice over 60 seconds. If it barely moves and iostat shows high await, the threads are waiting on disk, not compute.
Check for insert-pattern changes. Compare InsertQuery to InsertedRows in system.events. A sudden jump in query count with flat row count means smaller batches, which creates more parts per row and overwhelms the pool.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Background pool utilization (`BackgroundMergesAndMutationsPoolTask / PoolSize`)	Headroom for merge and mutation work	Sustained > 0.9 for > 10 minutes
Active parts per partition	Distance to insert throttling and rejection	Within 50% of configured `parts_to_delay_insert`
DelayedInserts	Earliest warning that merges are falling behind	Counter increasing over successive samples
Merge progress velocity	Whether active merges are completing or stuck	`progress` flat for > 5 minutes
Disk I/O await	Distinguishes thread starvation from I/O wait	> 20 ms on SSD, or trending sharply up
Mutation `parts_to_do`	Mutation backlog consuming pool capacity	Not decreasing over a 30-minute window

Fixes

Kill or pause heavy mutations

If system.merges shows is_mutation = 1 and system.mutations has stalled work, cancel non-critical mutations to free pool capacity.

-- Identify the mutation
SELECT database, table, mutation_id
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time
LIMIT 1;

-- Kill it. Destructive: partial work is discarded.
KILL MUTATION WHERE database = 'db' AND table = 'table' AND mutation_id = '0000000000';

Tradeoff: Killed mutations must be reissued later. On large tables, even valid mutations may take hours; schedule them during low-traffic windows.

Throttle or batch inserts

If the pool is saturated because inserts arrive faster than merges can consolidate, reduce the part creation rate. Increase batch size so each INSERT creates fewer parts, or temporarily slow the ingest pipeline.

Tradeoff: Upstream systems may need to buffer data or tolerate higher latency.

Increase merge concurrency

If the node has idle CPU and disk headroom, raise background_merges_mutations_concurrency_ratio above its default of 2. This increases the number of concurrent merge and mutation tasks. Configuration changes may not affect work already queued.

Tradeoff: More concurrent merges increase I/O and CPU consumption. Do not raise this if disk await is already elevated.

Address I/O saturation

If iostat shows high await and merge progress is frozen, the pool is not starved for threads; it is starved for disk. Reduce competing load:

Temporarily route heavy analytical queries away from the node.
Pause backups or large exports touching the same volumes.
If you use tiered storage, verify that hot-tier disk bandwidth is not the bottleneck.

Tradeoff: Reduced query capacity or added infrastructure cost.

Reclaim space to unblock merges

If disk space is tight, merges cannot allocate temporary output files and will stall. See ClickHouse disk space collapse: why merges need free space and how the spiral starts for emergency space recovery.

Prevention

Monitor the merge-to-insert ratio. Track part creation rate against merge completion rate. If the ratio stays above 1.0 for more than a few minutes, the pool is losing ground.
Avoid mutation storms. Do not issue ALTER UPDATE/DELETE during peak ingest. Mutations are full part rewrites, not row updates, and they monopolize pool threads. See ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates.
Batch inserts. Target thousands of rows per INSERT. Many small inserts create parts faster than any pool configuration can merge.
Size pools to I/O capacity, not just core count. A concurrency ratio of 2 is a starting point. If your storage cannot sustain random I/O from more concurrent merges, raising the ratio will only turn thread starvation into I/O saturation.

How Netdata helps

Correlate BackgroundMergesAndMutationsPoolTask with active part count and DelayedInserts on the same timeline to spot saturation.
Alert when pool utilization crosses 90% alongside positive part-count growth, catching the bottleneck before inserts delay.
Surface disk I/O latency next to merge activity to distinguish thread starvation from I/O waits without manual iostat.
Track mutation backlog and merge duration trends without polling system.mutations.
Visualize insert latency degradation as an early warning, often hours before RejectedInserts appears.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse background pool saturation: when merges and mutations starve

ClickHouse background pool saturation: when merges and mutations starve

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Kill or pause heavy mutations

Throttle or batch inserts

Increase merge concurrency

Address I/O saturation

Reclaim space to unblock merges

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata