ClickHouse mutation stuck: parts_to_do not decreasing and how to recover
Check system.mutations and find a mutation active for hours. parts_to_do has not moved in thirty minutes and parts are accumulating. Queries return, but insert latency climbs. On replicated tables, each replica processes mutations independently, so a stall on one node creates silent divergence while the rest of the cluster appears healthy.
Mutations rewrite data parts to apply ALTER UPDATE, ALTER DELETE, or projection changes. They run sequentially per table and share the background merge and mutation pool with regular merges. A stalled mutation blocks subsequent mutations for that table and consumes threads needed for merges. The result is merge starvation, insert delays, and eventually Too many parts rejections.
What this means
A stalled mutation creates a linear bottleneck per table because mutations run sequentially. On replicated tables, the stalled replica diverges until the mutation completes or is killed.
While stalled, the mutation occupies the background pool. Merges for that table slow or stop. Inserts create fresh parts that are not merged away, so active part count rises and parts_to_do stays flat.
flowchart TD
A[Mutation starts on table] --> B{parts_to_do decreasing?}
B -->|Yes| C[Normal progress]
B -->|No for >30 min| D[Stalled mutation]
D --> E[Blocks merge pool]
E --> F[Parts accumulate]
F --> G[Insert delays or rejections]
D --> H[Check latest_fail_reason]
H -->|Non-empty| I[Failing retry loop]
H -->|Empty| J[Pool saturated or resource blocked]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Failing mutation on a specific part | latest_fail_reason is non-empty; mutation retries the same parts without progress | SELECT latest_fail_reason, parts_to_do_names FROM system.mutations WHERE is_done = 0 |
| Background pool saturation | system.merges shows many long-running tasks; no free slots for the mutation to proceed | SELECT count(*) FROM system.merges and background pool metrics from system.metrics |
| Replication queue blockage | MUTATE_PART entries in system.replication_queue have high num_tries and exceptions | SELECT * FROM system.replication_queue WHERE type = 'MUTATE_PART' |
| Resource exhaustion | system.merges is empty despite load, or disk free space is critically low | system.disks and MemoryTracking in system.metrics |
| Legitimate slow progress on large table | parts_to_do decreases very slowly over hours; no error present | Sample parts_to_do at 5-minute intervals |
Quick checks
-- Active mutations with failure context
SELECT database, table, mutation_id, command, create_time, parts_to_do, is_done, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Snapshot background work to see pool contention
SELECT database, table, elapsed, progress, num_parts, is_mutation, result_part_name
FROM system.merges
ORDER BY elapsed DESC;
-- Replication queue for stuck mutation tasks
SELECT database, table, type, create_time, num_tries, last_exception
FROM system.replication_queue
WHERE type = 'MUTATE_PART'
ORDER BY num_tries DESC;
-- Disk headroom for merge and mutation temp output
SELECT name, path, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total
FROM system.disks;
-- Server memory pressure
SELECT metric, formatReadableSize(value) AS memory
FROM system.metrics
WHERE metric = 'MemoryTracking';
-- Active part count for the affected table
SELECT count() AS active_parts
FROM system.parts
WHERE database = 'db' AND table = 'tbl' AND active = 1;
How to diagnose it
Confirm the stall. Query
system.mutationswhereis_done = 0. Notecreate_timeandparts_to_do. Wait five minutes and re-query. Ifparts_to_dois unchanged, the mutation is stalled.Check for repeated failures. If
latest_fail_reasonis non-empty, the mutation is failing on a specific part and retrying without progress. Note the parts listed inparts_to_do_names.Inspect the background pool. Query
system.merges. If many entries showis_mutation = 1or very longelapsedtimes, the pool is saturated. If the pool is empty butparts_to_dois flat, the mutation may be blocked on resources or locks.Evaluate replication state on replicated tables. Query
system.replication_queueforMUTATE_PARTentries. Highnum_trieswithlast_exceptionindicates the replica cannot apply the mutation locally, often because a required source part is missing or a fetch is failing.Check resource availability. Query
system.disksto ensure free space is sufficient for temporary merge output. CheckMemoryTrackingto rule out memory pressure killing or starving background tasks.Distinguish slow from stuck. Very large tables can take hours or days to mutate. If
parts_to_dodrops even slightly across your sampling window, the mutation is slow but healthy. Do not kill it.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.mutations.parts_to_do | Measures remaining work per mutation | Flat for more than 30 minutes |
system.mutations.latest_fail_reason | Surfaces part-level failures that block progress | Non-empty on any is_done = 0 mutation |
| Active merge/mutation count | Shows whether the background pool is saturated | Sustained high count with flat parts_to_do |
| Active parts per partition | Stalled mutations block merges, so parts accumulate | Steady growth while a mutation is active |
Replication queue num_tries | Replicated mutations may fail silently on one replica | MUTATE_PART entries with num_tries > 5 |
| Disk free space | Merges and mutations need temp space to write new parts | Free space below 20% of total |
Fixes
Failing mutation with latest_fail_reason
When a mutation repeatedly fails, it will not self-heal. Identify the failing parts from parts_to_do_names. On a replicated table, check whether those parts are healthy on another replica. If a part is corrupt and the table is replicated, you may detach the bad part and re-fetch it from a healthy peer. Once the part is repaired or removed, the mutation should resume. If you cannot repair the part quickly, kill the mutation to unblock the queue:
KILL MUTATION WHERE database = 'db' AND table = 'tbl' AND mutation_id = 'id';
Killing leaves the table in a partially mutated state. Issue a replacement mutation only after confirming the root cause is resolved.
Background pool saturation
If system.merges shows long-running tasks consuming the pool, you must decide whether to wait or kill. Killing a non-critical mutation frees a pool slot and allows merges to resume. In-progress work on the current part is lost; already-completed parts remain mutated. If the pool is saturated by merges on a large table, killing the mutation may be the only way to prevent insert delays. After killing, monitor system.merges to confirm merges restart.
Replication queue blockage
For replicated tables, a mutation may stall because the local replica cannot fetch a required part or because a MUTATE_PART entry is stuck. Do not manually delete entries from system.replication_queue. Instead, investigate the source replica. If the replica is permanently diverged, use SYSTEM RESTART REPLICA or re-initialize the replica. These actions force the replica to reconcile its state and re-fetch missing parts. Expect temporary network load during recovery.
Resource exhaustion
If disk space is below the threshold needed for temporary merge output, background writes will halt. Free space immediately by detaching old partitions or adding capacity. If memory is saturated, large mutations may be aborted by the memory tracker. Address the memory pressure first by killing runaway queries or reducing concurrency. Resource fixes are the only durable solution; killing the mutation without fixing the resource only delays the next stall.
Legitimate slow mutation
For tables with billions of rows, mutations are inherently slow. If parts_to_do is decreasing, leave the mutation alone. Monitor part count to ensure merges are not starved. If you need faster completion, the only safe option is to reduce concurrent load on the node so the background pool can dedicate more cycles to the mutation.
Prevention
- Monitor
system.mutationsproactively. A mutation backlog is a leading indicator of merge starvation. - Avoid issuing many sequential
ALTER UPDATEorALTER DELETEstatements on the same table. Batch changes or use lightweight deletes where your version supports them. - Maintain disk free space well above 20%. Merges and mutations need headroom to write output.
- On replicated tables, monitor mutation progress per replica. A stall on one replica creates silent divergence.
- Set alerting on
parts_to_doflatlining for more than 30 minutes and on non-emptylatest_fail_reason.
How Netdata helps
- Correlate flat
parts_to_dowith rising active part count to expose merge starvation caused by stuck mutations. - Alert on background pool saturation and merge stall signals that accompany mutation blockage.
- Track per-table part growth trends to separate mutation-induced accumulation from insert overload.
- Surface disk and memory pressure metrics alongside mutation status to reveal resource root causes.
- Monitor replication lag per replica when mutations run on replicated tables, so a stalled replica does not go unnoticed.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool
- ClickHouse monitoring checklist: the signals every production cluster needs
- ClickHouse monitoring maturity model: from survival to expert







