ClickHouse query latency P99 spikes: tail latency, hot shards, and cold data
Your ClickHouse cluster shows healthy average query latency, but users report intermittent timeouts. P99 spikes while P50 stays flat. This divergence is tail latency: a subset of queries is dramatically slower. In distributed clusters, the culprit is often a single hot shard or a replica serving stale data. In tiered storage, a query touching cold S3 data can be one to two orders of magnitude slower.
ClickHouse tail latency does not degrade gracefully. A slow shard in a distributed query forces the coordinator to wait, inflating latency for every query that touches it. Cold cache after restart, part accumulation, and background merge I/O all produce the same symptom. The sections below distinguish the causes.
What this means
In ClickHouse, query_duration_ms from system.query_log is the canonical source for latency analysis. P50 represents typical performance; P99 represents user experience under stress. When they diverge, performance is inconsistent. The primary drivers are hot shards in distributed queries, cold data access in tiered storage, and resource contention on CPU, memory, or I/O.
A hot shard occurs when one replica or shard runs slower than its peers. Because distributed queries fan out sub-queries and wait for all shards to return, the slowest shard dictates total latency. Cluster-average metrics mask this entirely. Cold S3-tiered data produces random slow queries 10 to 100 times slower than hot local data, appearing as tail latency without pattern. Mark cache warmup after restart is expected and produces temporary latency elevation for 5 to 30 minutes.
flowchart TD
A[P99 spike detected] --> B{P50 stable?}
B -->|Yes| C[Tail latency issue]
B -->|No| D[Systemic degradation]
C --> E{Correlate with restart?}
E -->|Yes| F[Mark cache warmup]
E -->|No| G{Per-shard skew?}
G -->|Yes| H[Hot shard or slow replica]
G -->|No| I[Cold data or contention]
D --> J[Check CPU memory disk saturation]
I --> K[Check parts count merges memory]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Hot shard / slow replica | P99 spikes on distributed queries; one shard’s latency far exceeds others | Per-shard system.query_log latency; system.replicas for lag |
| Cold S3-tiered data | Random slow queries with no resource saturation; 10-100x latency jumps | Query patterns hitting remote storage tiers |
| Mark cache cold after restart | Elevated latency for all queries post-restart; improves over minutes | Server uptime and cache hit rates |
| Active part accumulation | Gradual P99 rise; more files per query | system.parts count per partition |
| Merge I/O starvation | Queries slow while background merges run; disk I/O saturated | system.merges progress and disk await |
| Memory pressure / spill-to-disk | Queries killed or extreme latency; temp directory growth | MemoryTracking and /var/lib/clickhouse/tmp/ |
| Distributed query amplification | Coordinator memory spikes or extreme latency; network spikes between shards | system.processes for GLOBAL IN/JOIN patterns |
Quick checks
-- Check P50 vs P99 divergence over the last hour
SELECT
toStartOfFiveMinutes(event_time) AS window,
quantile(0.50)(query_duration_ms / 1000) AS p50_seconds,
quantile(0.99)(query_duration_ms / 1000) AS p99_seconds,
count() AS query_count
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_time > now() - INTERVAL 1 HOUR
AND query_kind = 'Select'
GROUP BY window
ORDER BY window DESC;
-- List currently running queries and wall-clock time
SELECT
query_id,
user,
elapsed,
formatReadableSize(memory_usage) AS mem_used,
read_rows,
substring(query, 1, 200) AS query_prefix
FROM system.processes
WHERE elapsed > 60
ORDER BY elapsed DESC
LIMIT 20;
-- Active parts per partition
SELECT
database,
table,
partition_id,
count() AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;
-- Mark cache hit rate
SELECT
event,
value
FROM system.events
WHERE event IN ('MarkCacheHits', 'MarkCacheMisses');
-- Memory pressure and tracked usage
SELECT metric, value
FROM system.metrics
WHERE metric = 'MemoryTracking';
-- Replication lag and replica state
SELECT
database,
table,
absolute_delay,
queue_size,
is_readonly,
is_session_expired
FROM system.replicas;
# Disk I/O latency and utilization
iostat -xz 1 5
How to diagnose it
- Confirm tail latency. Run the P50/P99 query against
system.query_log. If P99 is more than 2x P50 for more than 15 minutes, you have tail latency rather than systemic degradation. - Correlate with restart or deployment. If the spike began immediately after a restart, expect mark cache warmup. Check the
Uptimemetric insystem.metricsand mark cache hit rates. If the hit rate is low and improving, allow 5 to 30 minutes before intervening. - Identify a hot shard. Compare per-shard query latency from each node’s
system.query_log. If one shard’s P99 is significantly higher than others, inspect that shard’s replication lag (absolute_delay), background pool utilization, and long-running queries insystem.processes. - Check for cold data access. If queries are touching S3-tiered storage, latency can jump 10-100x. Review query patterns for table access against cold volumes. No single
systemtable flags cold reads directly; correlate slow query times with tiered storage configuration and access patterns. - Inspect resource contention. Check active parts per partition. If counts are above 100, queries are opening more files. Check
system.mergesfor I/O competition. Runiostatto confirm disk saturation. - Check for memory pressure and spill-to-disk. If
MemoryTrackingis near limits and/var/lib/clickhouse/tmp/is growing, queries are spilling to disk. This turns memory pressure into I/O pressure. - Review distributed query patterns. In
system.processes, look forGLOBAL IN,GLOBAL JOIN, or missing shard-key filters. These fan out to all shards and can spike coordinator memory and network, inflating latency.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Query latency P99 vs P50 | Divergence reveals tail latency issues | P99 > 2x baseline P50 for > 15 min |
| Per-shard query latency | Hot shards drag distributed query times | One shard P99 » cluster P99 |
| Mark cache hit rate | Cache misses cause disk seeks for index granules | Sustained miss rate with no improvement |
| Active parts per partition | More parts means more files and seeks per query | > 100 parts per partition |
| Background merge pool utilization | Merge I/O competes with query I/O | > 90% sustained with rising parts |
| Disk I/O await | Saturated storage slows reads and merges | > 20 ms on SSD, > 100 ms on spinning disk |
| MemoryTracking | Spill-to-disk or query kills increase latency | > 80% of max_server_memory_usage |
Replication lag (absolute_delay) | Stale replicas or slow catch-up affect shard performance | > 120 seconds sustained |
Fixes
Hot shard or slow replica
Route read traffic away from the lagging replica if your proxy or driver supports it. On the affected node, identify the cause: runaway query, replication queue buildup, or disk saturation. If a single query is dominating resources, kill it:
KILL QUERY WHERE query_id = '...';
WARNING: Killing queries interrupts clients and can leave partial results in distributed pipelines. Coordinate with application owners when possible.
If replication is stuck, see ClickHouse detached parts piling up and ClickHouse Keeper latency high.
Cold S3-tiered data
If tail latency stems from cold storage access, reduce the working set that hits remote tiers. Tune TTL and move rules to keep hot data on local disks. Warm the cache by running representative queries after restarts or cache evictions. Consider increasing local cache size if the hot working set fits.
Mark cache warmup after restart
This is expected behavior. Do not restart the server again. If possible, pre-warm caches by running a representative query set before routing production traffic. If latency remains elevated beyond 30 minutes, investigate other causes.
Part accumulation and merge contention
If active parts per partition are high, merges are falling behind. Reduce query load temporarily to free I/O for merges, or add disk I/O capacity. Do not restart the server. See ClickHouse active part count growing.
Memory pressure and spill-to-disk
Kill the largest memory consumer from system.processes. Check for retry amplification from clients reissuing killed queries. If spill-to-disk is chronic, raise max_bytes_before_external_group_by to allow more in-memory aggregation, but only as a last resort: it increases memory pressure and OOM risk. Lower max_concurrent_queries temporarily to cap aggregate memory use.
Distributed query amplification
Kill the offending query in system.processes. Review the query for GLOBAL IN or GLOBAL JOIN against non-distributed tables. Add shard-key filters to the WHERE clause. Set distributed_product_mode appropriately to prevent cross-shard broadcast.
Prevention
Monitor per-shard P99 latency, not just cluster-level averages. Keep active parts per partition below 100. Maintain disk I/O headroom so merges do not starve queries. Set per-query memory limits and monitor MemoryTracking against server limits. If you use tiered storage, track which queries touch cold data and size your local cache accordingly. After any restart, allow mark cache warmup before declaring the cluster fully healthy. Configure TTL and move policies to prevent cold data from being accessed in latency-sensitive paths.
How Netdata helps
- Publishes ClickHouse P99 latency alongside per-second CPU, memory, and disk I/O metrics to expose resource contention.
- Tracks per-node query latency when monitoring individual shards, surfacing hot shards before they dominate distributed queries.
- Charts mark cache size, part count trends, and background pool utilization alongside latency.
- Alerts on disk I/O saturation and memory pressure that precede query latency spikes.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse disk space collapse: why merges need free space and how the spiral starts
- ClickHouse disk space monitoring: free_space, unreserved_space, and the 80% target
- ClickHouse full table scan: partition pruning failures and the primary key
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Keeper latency high: the early warning before sessions expire
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it







