ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
MEMORY_LIMIT_EXCEEDED errors climb in the query log. Queries that normally finish in seconds now take minutes or are killed outright. The ClickHouse process is near its memory limit, but killing the heaviest query only frees capacity for a moment before another query is killed. If the application retries immediately, pressure never drops. With spill-to-disk enabled, the bottleneck shifts to disk I/O, starving background merges and slowing the whole system.
This is the memory pressure death spiral: a composite failure pattern where memory saturation triggers query termination or spill-to-disk, rising latency provokes retries, and repeated attempts keep memory pinned near the limit. The spiral ends with an OS-level OOM kill if internal tracker and cgroup limits are misaligned, or with a merge crisis if temporary disk files consume all I/O bandwidth.
What this means
ClickHouse enforces memory limits hierarchically: per-query (max_memory_usage), per-user (max_memory_usage_for_user), and server-wide (max_server_memory_usage). When any tracker hits its limit, ClickHouse kills the offending query with exception code 241. The server-wide limit defaults to 90% of physical RAM. Per-query limits live in user profiles; if unset, a query allocates without an individual bound.
Once memory pressure begins, three effects compound:
- Query kills do not free headroom. Surviving queries, cache restoration, or background merges immediately reclaim the freed memory.
- Retries amplify load. An application that reissues a killed query immediately sends fresh allocations into the same constrained pool.
- Spill-to-disk shifts the bottleneck. If spill-to-disk is enabled, queries write temporary files instead of failing. This turns memory pressure into disk I/O pressure that competes with merges and increases latency.
The result is a feedback loop: pressure causes slowdowns and kills, which trigger retries, which restore pressure. Breaking the loop requires identifying whether the root cause is a single runaway query, unbounded concurrency, or an application retry storm.
flowchart TD
A[Workload surge or runaway query] --> B[Memory allocations rise]
B --> C{Near max_server_memory_usage?}
C -->|Yes| D[Queries killed or spilled to disk]
D --> E[Query latency increases]
E --> F[Application retries]
F --> B
D --> G[Disk I/O saturation]
G --> H[Merge throughput drops]
H --> I[Part count rises]
I --> J[Further latency increase]
J --> ECommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Runaway query | A single query shows peak_memory_usage orders of magnitude above the norm in system.processes | SELECT query_id, memory_usage, peak_memory_usage, query FROM system.processes ORDER BY memory_usage DESC LIMIT 5 |
| Retry amplification | The same query pattern appears repeatedly in system.query_log with exception_code = 241 | SELECT exception_code, count(), any(exception) FROM system.query_log WHERE type = 'ExceptionWhileProcessing' AND event_time > now() - INTERVAL 10 MINUTE GROUP BY exception_code |
| Unbounded concurrent analytics | Total memory across all running queries approaches 70% of the server limit with no single dominant query | SELECT formatReadableSize(sum(memory_usage)) FROM system.processes |
| Cache over-allocation | MemoryTracking is high despite low query concurrency; mark cache or uncompressed cache consume most RAM | SELECT metric, value FROM system.asynchronous_metrics WHERE metric IN ('MarkCacheBytes', 'UncompressedCacheBytes') |
| Spill-to-disk without I/O headroom | /var/lib/clickhouse/tmp/ grows and disk I/O latency spikes while queries continue rather than fail | ls -lah /var/lib/clickhouse/tmp/ and iostat -xz 1 5 |
| Background merge memory spikes | Memory pressure coincides with long-running merges in system.merges | SELECT database, table, elapsed, formatReadableSize(memory_usage) FROM system.merges ORDER BY memory_usage DESC |
Quick checks
# Check ClickHouse tracked memory vs configured server limit
clickhouse-client -q "SELECT formatReadableSize(value) AS memory_tracking FROM system.metrics WHERE metric = 'MemoryTracking'"
# Check OS RSS and peak to spot divergence from tracked memory
cat /proc/$(pgrep clickhouse-server)/status | grep -E '^(VmRSS|VmPeak|VmSize)'
# Top memory-consuming queries right now
clickhouse-client -q "SELECT query_id, formatReadableSize(memory_usage) AS mem, formatReadableSize(peak_memory_usage) AS peak, substring(query, 1, 120) AS q FROM system.processes WHERE memory_usage > 0 ORDER BY memory_usage DESC LIMIT 10"
# Recent MEMORY_LIMIT_EXCEEDED errors
clickhouse-client -q "SELECT query_id, exception_code, substring(query, 1, 120) AS q FROM system.query_log WHERE type = 'ExceptionWhileProcessing' AND exception_code = 241 AND event_time > now() - INTERVAL 10 MINUTE LIMIT 10"
# Check for spill-to-disk activity in tmp
ls -lah /var/lib/clickhouse/tmp/
# Check concurrent query count and total query memory
clickhouse-client -q "SELECT count() AS queries, formatReadableSize(sum(memory_usage)) AS total_query_mem FROM system.processes"
# Check mark cache size
clickhouse-client -q "SELECT metric, formatReadableSize(value) FROM system.asynchronous_metrics WHERE metric = 'MarkCacheBytes'"
# Check swap usage (swap thrashing makes recovery nearly impossible)
free -h | grep -i swap
How to diagnose it
- Confirm server-wide pressure. Compare
MemoryTrackingfromsystem.metricstomax_server_memory_usage. If the ratio is sustained above 80%, the server is in the danger zone. Also checkMemoryResidentinsystem.asynchronous_metricsbecause RSS can exceed tracked memory due to allocator overhead and untracked allocations. - Identify the memory consumer category. Run the
system.processesquery sorted bymemory_usage. If one query dominates, it is a runaway query. If many moderate queries sum to most of the limit, it is a concurrency problem. - Check for retry amplification. Query
system.query_logfor exception code 241 over the last 10 minutes. If the same query fingerprint appears multiple times with short intervals, the application is retrying immediately. - Check spill-to-disk activity. Inspect
/var/lib/clickhouse/tmp/. Growing files there mean queries are spilling. Correlate with disk I/O metrics (iostator your infrastructure monitoring). High I/O wait during memory pressure confirms the bottleneck has shifted to disk. - Correlate with merge health. Check
system.merges. If merges are running but disk I/O is saturated, merge throughput drops and parts begin to accumulate. This is the secondary death spiral. - Check OS and cgroup limits. In containerized deployments, verify that the cgroup memory limit is not close to ClickHouse’s internal limit. If the container limit is lower than or equal to
max_server_memory_usage, the OS OOM killer can fire before ClickHouse’s circuit breaker engages.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| MemoryTracking / max_server_memory_usage | Measures proximity to ClickHouse’s internal query-kill threshold | Sustained ratio > 80% |
| Peak per-query memory | Identifies runaway queries before they consume the server | Single query > 50% of server limit |
| FailedQuery rate with exception 241 | Direct evidence of the kill-and-retry loop | Any sustained increase over baseline |
| Mark cache hit rate | Memory pressure evicts caches, causing more disk I/O | < 80% during pressure events (from system.events MarkCacheHits and MarkCacheMisses) |
| Disk I/O await | Spill-to-disk and cache misses shift saturation to I/O | > 20 ms on SSD sustained |
| Active merge count and throughput | I/O competition from spills starves merges, causing part accumulation | Merge throughput flat or falling while parts grow |
| OSMemoryAvailable | The OS OOM killer operates independently of ClickHouse trackers | < 5% of total RAM |
| Concurrent query count | Heavy concurrency compounds moderate per-query memory into server pressure | Approaching max_concurrent_queries or 2x typical peak |
Fixes
Kill the runaway query and bound per-query memory
Warning: Killing a query is disruptive. The client receives an error and may retry immediately, so coordinate with the application owner before killing production traffic.
Identify the top consumer in system.processes and kill it:
KILL QUERY WHERE query_id = '<query_id>';
Set max_memory_usage in the user profile to prevent recurrence. A sensible starting point is 10-20% of max_server_memory_usage for ad-hoc users, lower for service accounts. The tradeoff is that legitimate large queries fail instead of completing slowly.
Stop retry amplification
Implement exponential back-off and circuit-breaking at the application layer. A query that failed with code 241 should wait seconds, not milliseconds, before retrying, and should give up after a small number of attempts. The tradeoff is temporary query unavailability, but this prevents the retry loop from pinning the server at 100% memory.
Reduce or remove spill-to-disk pressure
If max_bytes_before_external_group_by or max_bytes_before_external_sort are configured and disk I/O is saturated, you have two choices. Raise the thresholds so fewer queries spill, accepting that more will hit code 241. Or leave them enabled but ensure disk I/O capacity is reserved for merges so spills do not compete on the same device. The tradeoff is between query failure and I/O saturation.
Temporarily lower concurrency
If the root cause is a surge of legitimate concurrent heavy queries, reduce max_concurrent_queries temporarily. This queues or rejects new queries until memory drops. The tradeoff is that some client requests fail or wait.
Reclaim memory from caches
If MarkCacheBytes or UncompressedCacheBytes are consuming most RAM, reduce mark_cache_size or disable the uncompressed cache. This frees memory for query working sets at the cost of more disk seeks and decompression on repeated queries.
Align container and ClickHouse limits
In Kubernetes or containerized deployments, ensure the cgroup memory limit leaves headroom above max_server_memory_usage. This prevents the OS OOM killer from terminating the process before ClickHouse can kill the offending query. If the container limit is too tight, lower the configured server-wide limit to leave that headroom.
Prevention
- Set per-query memory limits for every non-admin user profile. An unbounded query can trigger server-wide pressure.
- Monitor both
MemoryTrackingand OS RSS. Alert on divergence because tracked memory underestimates actual footprint. - Instrument application retry logic with back-off and jitter. Immediate retries on 241 are an anti-pattern.
- Size caches conservatively. Leave RAM for query working memory and background merges.
- Validate spill-to-disk thresholds against I/O capacity before enabling. Spilling is only a safety valve if the disk subsystem can absorb the extra load without starving merges.
- Review queries that use
GLOBAL INor largeJOINs. These can cause unbounded memory growth during aggregation.
How Netdata helps
Netdata correlates ClickHouse MemoryTracking with OS RSS and cgroup memory limits in one chart, surfacing tracker divergence immediately.
Netdata alerts on spikes in exception code 241 and sustained memory saturation without requiring manual system.query_log polling.
Netdata disk I/O latency charts, shown alongside query latency, make it easy to see when spill-to-disk shifts the bottleneck from memory to I/O.
Netdata collects per-query memory from system.processes continuously, so runaway queries are visible before they dominate the server.
Netdata merge pool and active part count charts detect secondary merge starvation caused by I/O competition.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool
- ClickHouse monitoring checklist: the signals every production cluster needs
- ClickHouse monitoring maturity model: from survival to expert
- ClickHouse projections and hidden parts: the part count you can’t see
- ClickHouse small inserts anti-pattern: why single-row inserts melt the merge pool







