ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates

Your inserts are slowing down. system.merges shows long-running background tasks with is_mutation = 1, while the active part count climbs toward the parts_to_delay_insert threshold. Write latency rises and DelayedInserts increases even though memory, disk, and CPU are not exhausted. The culprit is usually an application treating ClickHouse as an OLTP store, issuing ALTER TABLE UPDATE or ALTER TABLE DELETE as routine operations.

In ClickHouse, these statements do not perform row-level updates. They create mutations: asynchronous background jobs that rewrite entire data parts. Mutations queue sequentially per table, consume the same background pool that performs merges, and can silently starve the merge process until parts accumulate and inserts begin to reject.

What this means

When you issue ALTER TABLE UPDATE or ALTER TABLE DELETE, ClickHouse inserts a job into system.mutations. That job rewrites every active part in the target table. Multiple mutations stack up and execute one at a time per table, creating an O(mutations x parts) workload. Each mutation occupies a slot in the background merge and mutation pool, which defaults to a concurrency ratio of 2. While mutations run, regular merges are blocked or delayed.

The result is silent merge starvation. Parts that would normally be consolidated continue to accumulate. Query performance degrades as file counts rise. Eventually insert throttling begins, followed by hard rejections when a partition crosses parts_to_throw_insert.

Lightweight DELETE FROM uses a different mechanism based on mask files. It is less disruptive than classic ALTER TABLE DELETE mutations, though deleted rows are physically removed only during subsequent merges. For update-heavy workloads, avoid mutations entirely and use ReplacingMergeTree or CollapsingMergeTree, which handle changes through insert semantics and background deduplication rather than part rewrites.

flowchart TD
    A[Frequent ALTER UPDATE or DELETE] --> B[Mutation queued per table]
    B --> C[Rewrites every active part]
    C --> D[Background merge pool consumed]
    D --> E[Regular merges blocked]
    E --> F[Active part count rises]
    F --> G[Inserts delayed then rejected]

Common causes

CauseWhat it looks likeFirst thing to check
Application using ALTER TABLE UPDATE for incremental changessystem.mutations shows multiple pending mutations on the same table with is_done = 0SELECT ... FROM system.mutations WHERE is_done = 0
Batch purging with ALTER TABLE DELETEparts_to_do decreasing very slowly or stalledsystem.merges for is_mutation = 1 count
ORM or BI tool emitting ALTER statementsFrequent Alter entries in system.query_log from application usersQuery log filtered by query_kind = 'Alter'
One-off corrections queuing behind a large table mutationSmall mutation blocked for hours behind a rewrite of a billion-row tablecreate_time and parts_to_do per mutation
Lack of ReplacingMergeTree or CollapsingMergeTree for mutable dataPlain MergeTree engine with frequent “fix” operationsSHOW CREATE TABLE for engine type

Quick checks

-- Active mutation queue and remaining work
SELECT database, table, mutation_id, command, create_time, is_done, parts_to_do
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Distinguish mutations from regular merges in the background pool
SELECT count(*) AS active_merges,
       countIf(is_mutation = 1) AS mutations,
       countIf(is_mutation = 0) AS regular_merges
FROM system.merges;
-- Check if inserts are already being throttled or rejected
SELECT event, value
FROM system.events
WHERE event IN ('DelayedInserts', 'RejectedInserts');
-- Worst-case part count across all partitions
SELECT value
FROM system.asynchronous_metrics
WHERE metric = 'MaxPartCountForPartition';
-- Per-table active part counts to spot accumulation
SELECT database, table, count(*) AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 10;
# Recent mutation-related log lines (path varies by installation)
grep -i mutation /var/log/clickhouse-server/*.log | tail -20

How to diagnose it

  1. Confirm mutation backlog. Query system.mutations WHERE is_done = 0. If multiple rows exist for the same table, they are queuing sequentially and each must process every part.
  2. Check pool allocation. Query system.merges. If is_mutation = 1 dominates the active task list while part counts are high, mutations are monopolizing the background pool.
  3. Correlate with part growth. Compare MaxPartCountForPartition over time. A sustained rise while mutations run indicates merges are starved.
  4. Check insert pipeline pressure. Verify DelayedInserts and RejectedInserts counters. Any sustained increase confirms the write path is backing up.
  5. Identify the source. Search system.query_log for query_kind = 'Alter' to find which user or service account is issuing mutations.
  6. Assess replication impact. For ReplicatedMergeTree, compare system.mutations across replicas. If parts_to_do lags on one replica, that node is the bottleneck for cluster-wide completion.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.mutations queue depth (is_done = 0)Direct measure of pending mutation workMore than one pending mutation per table sustained
parts_to_do flat or decreasing slowlyIndicates stalled or extremely slow mutationNo progress for longer than 30 minutes
system.merges with is_mutation = 1Shows pool capacity consumed by mutationsAll active background tasks are mutations
MaxPartCountForPartitionPredicts “too many parts” before inserts failGreater than 500 parts per partition
DelayedInserts / RejectedInsertsConfirms merge or mutation backlog affecting writesAny sustained increase
Active background pool tasksMerge and mutation tasks share this poolsystem.merges task count near the concurrency ratio limit with growing parts

Fixes

Kill unnecessary mutations

If a mutation was issued by mistake or is no longer needed, remove it from the queue.

-- Identify the mutation ID
SELECT database, table, mutation_id
FROM system.mutations
WHERE is_done = 0;

-- Kill the mutation
KILL MUTATION WHERE database = 'db' AND table = 'table' AND mutation_id = 'mutation_id';

Warning: Killing a mutation is disruptive and may leave parts in an intermediate state. Test cancellation behavior on your ClickHouse version before using this during an incident.

Switch to ReplacingMergeTree or CollapsingMergeTree

For workloads that require updates or deletes, redesign the table to avoid mutations:

  • ReplacingMergeTree: Insert new rows with updated values and a version column. Background merges retain the highest version per ORDER BY key. Use FINAL or deduplicate at query time.
  • CollapsingMergeTree: Insert cancel rows (sign = -1) and replacement rows (sign = +1). Background merges collapse matching pairs.

Tradeoff: These engines shift effort from write-time mutation to read-time filtering. Queries must handle duplicate rows that have not yet been merged.

Use lightweight DELETE

If the use case is row deletion, prefer DELETE FROM table WHERE ... over ALTER TABLE DELETE. This applies a mask file instead of rewriting entire parts. It is significantly less I/O-intensive, though cleanup still requires subsequent merges.

Collapse mutations into fewer statements

If multiple column updates are needed, batch them into a single ALTER TABLE UPDATE with a compound condition. One mutation rewriting parts once is far better than five sequential mutations each rewriting the same parts.

Increase background pool headroom

If mutations are legitimate and the server has CPU and disk headroom, you can increase merge and mutation concurrency:

SELECT name, value
FROM system.settings
WHERE name = 'background_merges_mutations_concurrency_ratio';

Only increase this if disk I/O and CPU are not saturated. More concurrency amplifies resource usage without solving the root cause of excessive mutations.

Address stuck or failed mutations

If latest_fail_reason is non-empty in system.mutations, the mutation is failing repeatedly. A failed mutation can block all subsequent mutations on that table. Check disk space, part integrity, and permissions. Resolve the underlying failure or kill the mutation to unblock the queue.

Prevention

  • Never use ALTER TABLE UPDATE/DELETE as a routine application pattern. Treat it as a rare administrative tool.
  • Design tables for immutable inserts. Use ReplacingMergeTree, CollapsingMergeTree, or versioned insert logic instead of in-place updates.
  • Batch corrections into single statements. Avoid issuing many small mutations.
  • Monitor system.mutations proactively. Alert on any is_done = 0 mutation that lasts longer than expected for your table size.
  • Audit query logs for ALTER patterns. Application service accounts should not hold privileges to issue ALTER TABLE UPDATE/DELETE.
  • Watch background pool saturation. If background task utilization is consistently near the concurrency limit, investigate whether mutations are the cause before part counts explode.

How Netdata helps

  • Correlates mutation backlog with merge starvation: Netdata collects system.mutations queue depth alongside system.merges activity, surfacing when is_mutation tasks dominate the background pool.
  • Tracks part count growth: Charts MaxPartCountForPartition and per-table active parts to expose accumulation caused by blocked merges.
  • Surfaces insert pipeline pressure: Alerts on DelayedInserts and RejectedInserts to catch downstream write-path impact.
  • Links resource usage: Correlates mutation-driven I/O with disk latency and memory pressure, distinguishing mutation overload from query load.