ClickHouse slow queries: diagnosis from query_log to plan to fix
P99 latency climbs before averages move. In ClickHouse, tail latency spikes usually mean a subset of queries is hitting cold data, scanning too many parts, or spilling to disk. The goal is to separate plan problems from resource problems fast, then confirm the fix before the next batch of queries arrives.
What this means
Slow queries show up as rising query_duration_ms in system.query_log for QueryFinish events, or rising elapsed in system.processes. ClickHouse latency is sensitive to how many parts a query opens, whether the mark cache is warm, and whether the pipeline stays in memory.
| Pattern | Likely cause |
|---|---|
| High latency, low CPU | I/O wait from too many parts or cold mark cache |
| High latency, high memory | Spill-to-disk during GROUP BY or ORDER BY |
| High latency on one table after restart | Mark cache warmup |
| Broad latency increase across all queries | Resource contention or system-wide cache cold start |
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Missing partition pruning | Latency jumps on one table; read_rows is far larger than the date range implies | system.query_log.read_rows for the slow query vs. a fast baseline |
| High active part count per partition | General degradation across many queries on the same table | Active parts in system.parts for the target partition |
| Cold mark cache or OS page cache | Spike right after restart or failover; improves over minutes to hours | Mark cache hit rate from system.events |
| Spill-to-disk during GROUP BY or ORDER BY | Single query with very high peak_memory_usage and disk I/O rising | system.processes.memory_usage, temp directory growth |
| Resource contention | Broad latency increase; merges and queries compete for disk or RAM | Aggregate memory in system.processes, OS disk metrics, system.merges |
Quick checks
Run these read-only checks to orient yourself during an incident.
-- P99 tail latency trend over the last hour
SELECT
toStartOfFiveMinutes(event_time) AS window,
quantile(0.99)(query_duration_ms / 1000) AS p99_seconds,
quantile(0.50)(query_duration_ms / 1000) AS p50_seconds,
count() AS query_count
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_time > now() - INTERVAL 1 HOUR
AND query_kind = 'Select'
GROUP BY window
ORDER BY window DESC;
-- Live queries running longer than 60 seconds
SELECT
query_id,
user,
elapsed,
formatReadableSize(memory_usage) AS mem_used,
read_rows,
substring(query, 1, 200) AS query_prefix
FROM system.processes
WHERE elapsed > 60
ORDER BY elapsed DESC
LIMIT 20;
-- Top memory consumers among running queries
SELECT
query_id,
user,
formatReadableSize(memory_usage) AS current_mem,
formatReadableSize(peak_memory_usage) AS peak_mem,
elapsed,
read_rows
FROM system.processes
WHERE memory_usage > 0
ORDER BY memory_usage DESC
LIMIT 20;
-- Active part count per partition for the target table
SELECT
database,
table,
partition_id,
count(*) AS parts_in_partition
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY parts_in_partition DESC
LIMIT 20;
-- Mark cache hit rate
SELECT
sumIf(value, event = 'MarkCacheHits') AS hits,
sumIf(value, event = 'MarkCacheMisses') AS misses,
round(hits * 100.0 / (hits + misses + 1), 2) AS hit_rate_pct
FROM system.events
WHERE event IN ('MarkCacheHits', 'MarkCacheMisses');
-- Recent slow queries with read volume
SELECT
query_duration_ms,
read_rows,
read_bytes,
substring(query, 1, 200) AS query_prefix
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_time > now() - INTERVAL 10 MINUTE
ORDER BY query_duration_ms DESC
LIMIT 10;
# Temp directory size for spill detection.
# Path depends on the ClickHouse tmp_path setting; du can be slow on large directories.
du -sh /var/lib/clickhouse/tmp/
How to diagnose it
Confirm scope from
query_log. Filtersystem.query_logfortype = 'QueryFinish'andquery_kind = 'Select'. Compare current P99 against the same window yesterday. If only one query hash or one table is affected, suspect a plan or cache problem. If all queries are slow, suspect resource contention or a system-wide cache cold start.Identify the culprit query. If the query is still running, use
system.processesordered byelapsed DESC. Capturequery_id,read_rows, andmemory_usage. If it already finished, pull the full query text fromsystem.query_logwherequery_duration_msis an outlier.Check partition pruning. Compare
read_rowsin the slow execution against a known fast execution of the same logical query. Ifread_rowsis orders of magnitude larger, the query is likely doing a full scan. UseEXPLAIN indexes = 1 SELECT ...to verify the partition key appears in the filter and that ClickHouse is pruning rather than falling back to a full scan.Check part count for the target table. Query
system.partswithactive = 1grouped bypartition_id. If the relevant partition has hundreds of active parts, every query must open more files and perform more index lookups. This degrades latency even when the plan is correct. See ClickHouse active part count growing for the merge-side fix.Check cache state. Compute the mark cache hit rate from
MarkCacheHitsandMarkCacheMissesinsystem.events. A rate below 80 percent means queries are reading mark files from disk. After a restart, this is expected and recovers with traffic. If it stays low, the working set may exceed cache size or merges may be churning the cache.Check for spill-to-disk. If the query has high
peak_memory_usageand the server is near memory limits, ClickHouse may spill aggregation or sort data to disk. Check the temp directory for growth during the query. Spill saves the query from cancellation but trades memory for disk I/O and latency.Correlate with resource contention. Check
system.mergesfor active merges on the same table. Background merges are I/O-intensive and compete with queries. Check OS disk latency withiostat -xz 1 5. Ifsystem.processesshows many long-running queries but CPU is moderate, queries are likely I/O-bound from cold cache or merge interference.
flowchart TD
A[P99 latency spike] --> B{Scope from query_log}
B -->|One query/table| C[Check plan: EXPLAIN + read_rows]
B -->|All queries| D[Check system-wide resources]
C --> E{Partition pruning ok?}
E -->|No| F[Fix WHERE clause / partition key]
E -->|Yes| G[Check parts count + cache]
G --> H{Parts high or cache cold?}
H -->|Parts high| I[Address merge debt]
H -->|Cache cold| J[Wait warmup / pre-warm marks]
D --> K[Check memory, disk I/O, merges]
K --> L{Spill or saturation?}
L -->|Spill| M[Tune memory or rewrite query]
L -->|Saturation| N[Kill runaway query or throttle load]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.query_log.query_duration_ms P99 | First user-facing indicator of tail degradation | Sustained P99 > 2x baseline for > 15 minutes |
system.processes.elapsed | Live stuck queries | Any query > 10x expected duration or > 30 minutes |
system.processes.memory_usage | Queries that may spill or starve others | Single query > 50% of max_server_memory_usage |
| Active parts per partition | More parts means more files and seeks | > 100 active parts per partition |
| Mark cache hit rate | Cold marks force disk seeks | Sustained < 80% outside of restart windows |
| Disk I/O await | Merges and queries compete for bandwidth | > 20 ms on SSD or > 100 ms on spinning disk |
system.merges activity | Merges explain I/O noise and temporary slowdown | Elapsed > 1 hour with progress stalled |
Fixes
Partition pruning failure
If EXPLAIN indexes = 1 shows the query is scanning all partitions, rewrite the WHERE clause to include the partition key. ClickHouse prunes only when the filter is explicit and matches the partition expression. If the application generates dynamic ranges, ensure the range is pushed into the query rather than applied after a transformation that hides the partition key.
Tradeoff: Adding the partition key filter may require application changes, but it is the only way to avoid full scans on large tables.
High part count
If system.parts shows the target partition has hundreds of active parts, latency will suffer until parts are merged. You can reduce the incoming part creation rate by batching inserts. You can also run OPTIMIZE TABLE, but this is I/O-intensive and can make the incident worse if disk is already saturated. If mutations are consuming merge threads, you can kill non-critical mutations so regular merges can proceed.
Tradeoff: KILL MUTATION aborts work that must be retried later. OPTIMIZE TABLE uses significant I/O. Many small inserts make the problem worse.
Cold cache after restart
If the mark cache is cold and latency is elevated post-restart, the only fix is time and query traffic. If you have a known critical query set, run them once to populate the mark cache before routing full production traffic. Do not restart the server again; that resets the warmup.
Tradeoff: Pre-warming consumes I/O upfront. Schedule it during low-traffic windows.
Spill-to-disk
If a query is spilling because it exceeds memory limits, you have three options: increase the per-query memory limit if headroom exists, rewrite the query to reduce cardinality, or accept the spill latency if the result set is genuinely large.
Tradeoff: Raising memory limits avoids spill but risks server OOM if headroom is overestimated. Rewriting queries requires developer time but is the safest path.
Resource contention during incidents
If broad latency spikes correlate with high disk await or memory near limits, identify the heaviest query in system.processes and kill it:
KILL QUERY WHERE query_id = '...';
Then check whether merges are monopolizing I/O. If merges are the problem, do not kill them; throttle incoming query load temporarily so merges can complete. Use client-side rate limiting or the load balancer, not a server restart.
Tradeoff: Killing a runaway query restores resources immediately but fails the upstream request. Throttling query load is slower but safer.
Prevention
- Monitor part count at the partition level, not just the table level. Alert when any partition crosses 100 active parts.
- Set per-query memory limits (
max_memory_usage) for non-admin users so a single query cannot exhaust the server. - Review query patterns in staging with
EXPLAIN indexes = 1before they reach production. Verify partition pruning and estimateread_rows. - Monitor the merge-to-insert ratio. If part creation rate chronically exceeds merge completion rate, latency will degrade before inserts are rejected.
- Configure TTL and monitor disk space growth so storage pressure does not stall merges and indirectly slow queries.
How Netdata helps
- Correlates ClickHouse query P99 latency with per-process CPU, disk await, and memory on the same timeline to separate plan problems from resource problems.
- Shows
system.query_log-derived latency charts alongsidesystem.partsactive part counts so merge debt is visible as read performance degrades. - Tracks mark cache hit rates and
MEMORY_LIMIT_EXCEEDEDerror counts. - Monitors live query concurrency and background merge activity.
- Alerts on disk I/O saturation and memory pressure before spill-to-disk events dominate latency.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse detached parts piling up: reading system.detached_parts and reclaiming space
- ClickHouse disk space collapse: why merges need free space and how the spiral starts
- ClickHouse disk space monitoring: free_space, unreserved_space, and the 80% target
- ClickHouse distributed DDL stuck: ON CLUSTER queries that never finish
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse cannot connect to ZooKeeper/Keeper: diagnosing the coordination layer
- ClickHouse Keeper latency high: the early warning before sessions expire
- ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze







