ClickHouse slow queries: diagnosis from query_log to plan to fix

P99 latency climbs before averages move. In ClickHouse, tail latency spikes usually mean a subset of queries is hitting cold data, scanning too many parts, or spilling to disk. The goal is to separate plan problems from resource problems fast, then confirm the fix before the next batch of queries arrives.

What this means

Slow queries show up as rising query_duration_ms in system.query_log for QueryFinish events, or rising elapsed in system.processes. ClickHouse latency is sensitive to how many parts a query opens, whether the mark cache is warm, and whether the pipeline stays in memory.

PatternLikely cause
High latency, low CPUI/O wait from too many parts or cold mark cache
High latency, high memorySpill-to-disk during GROUP BY or ORDER BY
High latency on one table after restartMark cache warmup
Broad latency increase across all queriesResource contention or system-wide cache cold start

Common causes

CauseWhat it looks likeFirst thing to check
Missing partition pruningLatency jumps on one table; read_rows is far larger than the date range impliessystem.query_log.read_rows for the slow query vs. a fast baseline
High active part count per partitionGeneral degradation across many queries on the same tableActive parts in system.parts for the target partition
Cold mark cache or OS page cacheSpike right after restart or failover; improves over minutes to hoursMark cache hit rate from system.events
Spill-to-disk during GROUP BY or ORDER BYSingle query with very high peak_memory_usage and disk I/O risingsystem.processes.memory_usage, temp directory growth
Resource contentionBroad latency increase; merges and queries compete for disk or RAMAggregate memory in system.processes, OS disk metrics, system.merges

Quick checks

Run these read-only checks to orient yourself during an incident.

-- P99 tail latency trend over the last hour
SELECT
    toStartOfFiveMinutes(event_time) AS window,
    quantile(0.99)(query_duration_ms / 1000) AS p99_seconds,
    quantile(0.50)(query_duration_ms / 1000) AS p50_seconds,
    count() AS query_count
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 1 HOUR
  AND query_kind = 'Select'
GROUP BY window
ORDER BY window DESC;
-- Live queries running longer than 60 seconds
SELECT
    query_id,
    user,
    elapsed,
    formatReadableSize(memory_usage) AS mem_used,
    read_rows,
    substring(query, 1, 200) AS query_prefix
FROM system.processes
WHERE elapsed > 60
ORDER BY elapsed DESC
LIMIT 20;
-- Top memory consumers among running queries
SELECT
    query_id,
    user,
    formatReadableSize(memory_usage) AS current_mem,
    formatReadableSize(peak_memory_usage) AS peak_mem,
    elapsed,
    read_rows
FROM system.processes
WHERE memory_usage > 0
ORDER BY memory_usage DESC
LIMIT 20;
-- Active part count per partition for the target table
SELECT
    database,
    table,
    partition_id,
    count(*) AS parts_in_partition
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY parts_in_partition DESC
LIMIT 20;
-- Mark cache hit rate
SELECT
    sumIf(value, event = 'MarkCacheHits') AS hits,
    sumIf(value, event = 'MarkCacheMisses') AS misses,
    round(hits * 100.0 / (hits + misses + 1), 2) AS hit_rate_pct
FROM system.events
WHERE event IN ('MarkCacheHits', 'MarkCacheMisses');
-- Recent slow queries with read volume
SELECT
    query_duration_ms,
    read_rows,
    read_bytes,
    substring(query, 1, 200) AS query_prefix
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 10 MINUTE
ORDER BY query_duration_ms DESC
LIMIT 10;
# Temp directory size for spill detection.
# Path depends on the ClickHouse tmp_path setting; du can be slow on large directories.
du -sh /var/lib/clickhouse/tmp/

How to diagnose it

  1. Confirm scope from query_log. Filter system.query_log for type = 'QueryFinish' and query_kind = 'Select'. Compare current P99 against the same window yesterday. If only one query hash or one table is affected, suspect a plan or cache problem. If all queries are slow, suspect resource contention or a system-wide cache cold start.

  2. Identify the culprit query. If the query is still running, use system.processes ordered by elapsed DESC. Capture query_id, read_rows, and memory_usage. If it already finished, pull the full query text from system.query_log where query_duration_ms is an outlier.

  3. Check partition pruning. Compare read_rows in the slow execution against a known fast execution of the same logical query. If read_rows is orders of magnitude larger, the query is likely doing a full scan. Use EXPLAIN indexes = 1 SELECT ... to verify the partition key appears in the filter and that ClickHouse is pruning rather than falling back to a full scan.

  4. Check part count for the target table. Query system.parts with active = 1 grouped by partition_id. If the relevant partition has hundreds of active parts, every query must open more files and perform more index lookups. This degrades latency even when the plan is correct. See ClickHouse active part count growing for the merge-side fix.

  5. Check cache state. Compute the mark cache hit rate from MarkCacheHits and MarkCacheMisses in system.events. A rate below 80 percent means queries are reading mark files from disk. After a restart, this is expected and recovers with traffic. If it stays low, the working set may exceed cache size or merges may be churning the cache.

  6. Check for spill-to-disk. If the query has high peak_memory_usage and the server is near memory limits, ClickHouse may spill aggregation or sort data to disk. Check the temp directory for growth during the query. Spill saves the query from cancellation but trades memory for disk I/O and latency.

  7. Correlate with resource contention. Check system.merges for active merges on the same table. Background merges are I/O-intensive and compete with queries. Check OS disk latency with iostat -xz 1 5. If system.processes shows many long-running queries but CPU is moderate, queries are likely I/O-bound from cold cache or merge interference.

flowchart TD
    A[P99 latency spike] --> B{Scope from query_log}
    B -->|One query/table| C[Check plan: EXPLAIN + read_rows]
    B -->|All queries| D[Check system-wide resources]
    C --> E{Partition pruning ok?}
    E -->|No| F[Fix WHERE clause / partition key]
    E -->|Yes| G[Check parts count + cache]
    G --> H{Parts high or cache cold?}
    H -->|Parts high| I[Address merge debt]
    H -->|Cache cold| J[Wait warmup / pre-warm marks]
    D --> K[Check memory, disk I/O, merges]
    K --> L{Spill or saturation?}
    L -->|Spill| M[Tune memory or rewrite query]
    L -->|Saturation| N[Kill runaway query or throttle load]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.query_log.query_duration_ms P99First user-facing indicator of tail degradationSustained P99 > 2x baseline for > 15 minutes
system.processes.elapsedLive stuck queriesAny query > 10x expected duration or > 30 minutes
system.processes.memory_usageQueries that may spill or starve othersSingle query > 50% of max_server_memory_usage
Active parts per partitionMore parts means more files and seeks> 100 active parts per partition
Mark cache hit rateCold marks force disk seeksSustained < 80% outside of restart windows
Disk I/O awaitMerges and queries compete for bandwidth> 20 ms on SSD or > 100 ms on spinning disk
system.merges activityMerges explain I/O noise and temporary slowdownElapsed > 1 hour with progress stalled

Fixes

Partition pruning failure

If EXPLAIN indexes = 1 shows the query is scanning all partitions, rewrite the WHERE clause to include the partition key. ClickHouse prunes only when the filter is explicit and matches the partition expression. If the application generates dynamic ranges, ensure the range is pushed into the query rather than applied after a transformation that hides the partition key.

Tradeoff: Adding the partition key filter may require application changes, but it is the only way to avoid full scans on large tables.

High part count

If system.parts shows the target partition has hundreds of active parts, latency will suffer until parts are merged. You can reduce the incoming part creation rate by batching inserts. You can also run OPTIMIZE TABLE, but this is I/O-intensive and can make the incident worse if disk is already saturated. If mutations are consuming merge threads, you can kill non-critical mutations so regular merges can proceed.

Tradeoff: KILL MUTATION aborts work that must be retried later. OPTIMIZE TABLE uses significant I/O. Many small inserts make the problem worse.

Cold cache after restart

If the mark cache is cold and latency is elevated post-restart, the only fix is time and query traffic. If you have a known critical query set, run them once to populate the mark cache before routing full production traffic. Do not restart the server again; that resets the warmup.

Tradeoff: Pre-warming consumes I/O upfront. Schedule it during low-traffic windows.

Spill-to-disk

If a query is spilling because it exceeds memory limits, you have three options: increase the per-query memory limit if headroom exists, rewrite the query to reduce cardinality, or accept the spill latency if the result set is genuinely large.

Tradeoff: Raising memory limits avoids spill but risks server OOM if headroom is overestimated. Rewriting queries requires developer time but is the safest path.

Resource contention during incidents

If broad latency spikes correlate with high disk await or memory near limits, identify the heaviest query in system.processes and kill it:

KILL QUERY WHERE query_id = '...';

Then check whether merges are monopolizing I/O. If merges are the problem, do not kill them; throttle incoming query load temporarily so merges can complete. Use client-side rate limiting or the load balancer, not a server restart.

Tradeoff: Killing a runaway query restores resources immediately but fails the upstream request. Throttling query load is slower but safer.

Prevention

  • Monitor part count at the partition level, not just the table level. Alert when any partition crosses 100 active parts.
  • Set per-query memory limits (max_memory_usage) for non-admin users so a single query cannot exhaust the server.
  • Review query patterns in staging with EXPLAIN indexes = 1 before they reach production. Verify partition pruning and estimate read_rows.
  • Monitor the merge-to-insert ratio. If part creation rate chronically exceeds merge completion rate, latency will degrade before inserts are rejected.
  • Configure TTL and monitor disk space growth so storage pressure does not stall merges and indirectly slow queries.

How Netdata helps

  • Correlates ClickHouse query P99 latency with per-process CPU, disk await, and memory on the same timeline to separate plan problems from resource problems.
  • Shows system.query_log-derived latency charts alongside system.parts active part counts so merge debt is visible as read performance degrades.
  • Tracks mark cache hit rates and MEMORY_LIMIT_EXCEEDED error counts.
  • Monitors live query concurrency and background merge activity.
  • Alerts on disk I/O saturation and memory pressure before spill-to-disk events dominate latency.