ClickHouse query latency P99 spikes: tail latency, hot shards, and cold data

Your ClickHouse cluster shows healthy average query latency, but users report intermittent timeouts. P99 spikes while P50 stays flat. This divergence is tail latency: a subset of queries is dramatically slower. In distributed clusters, the culprit is often a single hot shard or a replica serving stale data. In tiered storage, a query touching cold S3 data can be one to two orders of magnitude slower.

ClickHouse tail latency does not degrade gracefully. A slow shard in a distributed query forces the coordinator to wait, inflating latency for every query that touches it. Cold cache after restart, part accumulation, and background merge I/O all produce the same symptom. The sections below distinguish the causes.

What this means

In ClickHouse, query_duration_ms from system.query_log is the canonical source for latency analysis. P50 represents typical performance; P99 represents user experience under stress. When they diverge, performance is inconsistent. The primary drivers are hot shards in distributed queries, cold data access in tiered storage, and resource contention on CPU, memory, or I/O.

A hot shard occurs when one replica or shard runs slower than its peers. Because distributed queries fan out sub-queries and wait for all shards to return, the slowest shard dictates total latency. Cluster-average metrics mask this entirely. Cold S3-tiered data produces random slow queries 10 to 100 times slower than hot local data, appearing as tail latency without pattern. Mark cache warmup after restart is expected and produces temporary latency elevation for 5 to 30 minutes.

flowchart TD
    A[P99 spike detected] --> B{P50 stable?}
    B -->|Yes| C[Tail latency issue]
    B -->|No| D[Systemic degradation]
    C --> E{Correlate with restart?}
    E -->|Yes| F[Mark cache warmup]
    E -->|No| G{Per-shard skew?}
    G -->|Yes| H[Hot shard or slow replica]
    G -->|No| I[Cold data or contention]
    D --> J[Check CPU memory disk saturation]
    I --> K[Check parts count merges memory]

Common causes

CauseWhat it looks likeFirst thing to check
Hot shard / slow replicaP99 spikes on distributed queries; one shard’s latency far exceeds othersPer-shard system.query_log latency; system.replicas for lag
Cold S3-tiered dataRandom slow queries with no resource saturation; 10-100x latency jumpsQuery patterns hitting remote storage tiers
Mark cache cold after restartElevated latency for all queries post-restart; improves over minutesServer uptime and cache hit rates
Active part accumulationGradual P99 rise; more files per querysystem.parts count per partition
Merge I/O starvationQueries slow while background merges run; disk I/O saturatedsystem.merges progress and disk await
Memory pressure / spill-to-diskQueries killed or extreme latency; temp directory growthMemoryTracking and /var/lib/clickhouse/tmp/
Distributed query amplificationCoordinator memory spikes or extreme latency; network spikes between shardssystem.processes for GLOBAL IN/JOIN patterns

Quick checks

-- Check P50 vs P99 divergence over the last hour
SELECT
    toStartOfFiveMinutes(event_time) AS window,
    quantile(0.50)(query_duration_ms / 1000) AS p50_seconds,
    quantile(0.99)(query_duration_ms / 1000) AS p99_seconds,
    count() AS query_count
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 1 HOUR
  AND query_kind = 'Select'
GROUP BY window
ORDER BY window DESC;
-- List currently running queries and wall-clock time
SELECT
    query_id,
    user,
    elapsed,
    formatReadableSize(memory_usage) AS mem_used,
    read_rows,
    substring(query, 1, 200) AS query_prefix
FROM system.processes
WHERE elapsed > 60
ORDER BY elapsed DESC
LIMIT 20;
-- Active parts per partition
SELECT
    database,
    table,
    partition_id,
    count() AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;
-- Mark cache hit rate
SELECT
    event,
    value
FROM system.events
WHERE event IN ('MarkCacheHits', 'MarkCacheMisses');
-- Memory pressure and tracked usage
SELECT metric, value
FROM system.metrics
WHERE metric = 'MemoryTracking';
-- Replication lag and replica state
SELECT
    database,
    table,
    absolute_delay,
    queue_size,
    is_readonly,
    is_session_expired
FROM system.replicas;
# Disk I/O latency and utilization
iostat -xz 1 5

How to diagnose it

  1. Confirm tail latency. Run the P50/P99 query against system.query_log. If P99 is more than 2x P50 for more than 15 minutes, you have tail latency rather than systemic degradation.
  2. Correlate with restart or deployment. If the spike began immediately after a restart, expect mark cache warmup. Check the Uptime metric in system.metrics and mark cache hit rates. If the hit rate is low and improving, allow 5 to 30 minutes before intervening.
  3. Identify a hot shard. Compare per-shard query latency from each node’s system.query_log. If one shard’s P99 is significantly higher than others, inspect that shard’s replication lag (absolute_delay), background pool utilization, and long-running queries in system.processes.
  4. Check for cold data access. If queries are touching S3-tiered storage, latency can jump 10-100x. Review query patterns for table access against cold volumes. No single system table flags cold reads directly; correlate slow query times with tiered storage configuration and access patterns.
  5. Inspect resource contention. Check active parts per partition. If counts are above 100, queries are opening more files. Check system.merges for I/O competition. Run iostat to confirm disk saturation.
  6. Check for memory pressure and spill-to-disk. If MemoryTracking is near limits and /var/lib/clickhouse/tmp/ is growing, queries are spilling to disk. This turns memory pressure into I/O pressure.
  7. Review distributed query patterns. In system.processes, look for GLOBAL IN, GLOBAL JOIN, or missing shard-key filters. These fan out to all shards and can spike coordinator memory and network, inflating latency.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Query latency P99 vs P50Divergence reveals tail latency issuesP99 > 2x baseline P50 for > 15 min
Per-shard query latencyHot shards drag distributed query timesOne shard P99 » cluster P99
Mark cache hit rateCache misses cause disk seeks for index granulesSustained miss rate with no improvement
Active parts per partitionMore parts means more files and seeks per query> 100 parts per partition
Background merge pool utilizationMerge I/O competes with query I/O> 90% sustained with rising parts
Disk I/O awaitSaturated storage slows reads and merges> 20 ms on SSD, > 100 ms on spinning disk
MemoryTrackingSpill-to-disk or query kills increase latency> 80% of max_server_memory_usage
Replication lag (absolute_delay)Stale replicas or slow catch-up affect shard performance> 120 seconds sustained

Fixes

Hot shard or slow replica

Route read traffic away from the lagging replica if your proxy or driver supports it. On the affected node, identify the cause: runaway query, replication queue buildup, or disk saturation. If a single query is dominating resources, kill it:

KILL QUERY WHERE query_id = '...';

WARNING: Killing queries interrupts clients and can leave partial results in distributed pipelines. Coordinate with application owners when possible.

If replication is stuck, see ClickHouse detached parts piling up and ClickHouse Keeper latency high.

Cold S3-tiered data

If tail latency stems from cold storage access, reduce the working set that hits remote tiers. Tune TTL and move rules to keep hot data on local disks. Warm the cache by running representative queries after restarts or cache evictions. Consider increasing local cache size if the hot working set fits.

Mark cache warmup after restart

This is expected behavior. Do not restart the server again. If possible, pre-warm caches by running a representative query set before routing production traffic. If latency remains elevated beyond 30 minutes, investigate other causes.

Part accumulation and merge contention

If active parts per partition are high, merges are falling behind. Reduce query load temporarily to free I/O for merges, or add disk I/O capacity. Do not restart the server. See ClickHouse active part count growing.

Memory pressure and spill-to-disk

Kill the largest memory consumer from system.processes. Check for retry amplification from clients reissuing killed queries. If spill-to-disk is chronic, raise max_bytes_before_external_group_by to allow more in-memory aggregation, but only as a last resort: it increases memory pressure and OOM risk. Lower max_concurrent_queries temporarily to cap aggregate memory use.

Distributed query amplification

Kill the offending query in system.processes. Review the query for GLOBAL IN or GLOBAL JOIN against non-distributed tables. Add shard-key filters to the WHERE clause. Set distributed_product_mode appropriately to prevent cross-shard broadcast.

Prevention

Monitor per-shard P99 latency, not just cluster-level averages. Keep active parts per partition below 100. Maintain disk I/O headroom so merges do not starve queries. Set per-query memory limits and monitor MemoryTracking against server limits. If you use tiered storage, track which queries touch cold data and size your local cache accordingly. After any restart, allow mark cache warmup before declaring the cluster fully healthy. Configure TTL and move policies to prevent cold data from being accessed in latency-sensitive paths.

How Netdata helps

  • Publishes ClickHouse P99 latency alongside per-second CPU, memory, and disk I/O metrics to expose resource contention.
  • Tracks per-node query latency when monitoring individual shards, surfacing hot shards before they dominate distributed queries.
  • Charts mark cache size, part count trends, and background pool utilization alongside latency.
  • Alerts on disk I/O saturation and memory pressure that precede query latency spikes.