ClickHouse full table scan: partition pruning failures and the primary key
A query that returned in milliseconds yesterday now takes minutes and saturates CPU. Check whether it is reading every row before you add vCPUs or shards. A missing filter on the partition key or a mismatch with the primary key prefix forces ClickHouse to open every part and scan every granule. Symptoms are runaway read_rows counts, tail latency spikes, and sustained CPU decompression that hardware scaling cannot fix.
ClickHouse is not a row-store with dense B-tree indexes. It uses a sparse primary index built from the ORDER BY key and directory-level pruning via the PARTITION BY clause. When these are unused, the engine reads all column files, decompresses every granule, and filters row by row. system.query_log will show read_rows near the total table row count. The fix is in the query shape, not cluster size.
Start by looking for queries where read_rows equals or exceeds the target table’s row count. That ratio is the smoking gun.
What this means
ClickHouse stores data in immutable parts on disk. Each table has two layers of skipping logic.
The partition key (PARTITION BY) determines which parts contain relevant data. If the WHERE clause includes the partition key column, ClickHouse prunes entire directories before opening files. Without it, every active part is a candidate.
Inside each part, the sparse primary index (the ORDER BY key) stores one entry per granule. If the WHERE clause filters on the first column of the ORDER BY key, ClickHouse narrows the granule range. If the filter skips the first column but uses the second, the index cannot narrow the range because the data is sorted lexicographically by the full key prefix. The result is a full granule scan across all selected parts.
A full table scan is a silent performance failure, not a syntax error. It shows up as high read_rows, elevated CPU from decompression, and disk I/O that looks like saturation but is wasted work.
flowchart TD
A[Query arrives] --> B{WHERE has partition key filter?}
B -->|No| C[Open all parts]
B -->|Yes| D[Skip irrelevant partitions]
C --> E{WHERE has primary key prefix filter?}
D --> E
E -->|No| F[Scan all granules in selected parts]
E -->|Yes| G[Skip granules using sparse index]
F --> H[High read_rows, slow latency]
G --> I[Low read_rows, fast latency]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Missing partition key filter | read_rows equals total table rows; latency spikes on time-range queries | SHOW CREATE TABLE to confirm PARTITION BY column |
| Primary key prefix skipped | WHERE filters a later ORDER BY column without filtering the first; index is unused | SHOW CREATE TABLE to confirm ORDER BY key order |
| Function or expression on key column | WHERE wraps the partition or primary key in a function or cast | EXPLAIN indexes = 1 to verify key conditions |
| Data type mismatch | Date compared to DateTime or string literal compared to integer; equality looks correct but types differ | Schema column types versus query literals |
| OR conditions across key columns | WHERE a = 1 OR b = 2 prevents the sparse index from narrowing the granule range | EXPLAIN indexes = 1 shows wide granule scans |
Quick checks
Run these safe, read-only checks during an incident to confirm whether a full table scan is the culprit.
-- Check table engine, partition key, and primary key order
SHOW CREATE TABLE db.table;
-- Find recent queries that read an enormous number of rows
SELECT
query_id,
substring(query, 1, 120) AS query_prefix,
read_rows,
query_duration_ms / 1000 AS duration_sec
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_time > now() - INTERVAL 10 MINUTE
ORDER BY query_duration_ms DESC
LIMIT 10;
-- See currently running heavy queries and their row progress
SELECT
query_id,
elapsed,
read_rows,
substring(query, 1, 120) AS query_prefix
FROM system.processes
WHERE read_rows > 0
ORDER BY read_rows DESC
LIMIT 10;
-- Total rows for comparison against read_rows
SELECT count() FROM db.table;
-- Row count per partition from system.parts (faster on very large tables)
SELECT partition, sum(rows) AS rows
FROM system.parts
WHERE database = 'db' AND table = 'table' AND active
GROUP BY partition
ORDER BY partition;
-- Verify whether ClickHouse uses partition and primary key indexes
EXPLAIN indexes = 1
SELECT * FROM db.table WHERE ...;
How to diagnose it
- Find the slow query in
system.processesorsystem.query_log. Notequery_id,read_rows, and duration. - Compare
read_rowsto the table row count. A ratio above 0.8 usually means a full table or full partition scan. - Run
SHOW CREATE TABLE. Record thePARTITION BYandORDER BYcolumns exactly. - Check the
WHEREclause for the partition key. If it is absent or wrapped in a function, pruning failed. - Check the
WHEREclause against theORDER BYprefix. If the first key column is missing or transformed, the sparse index is unused. - Run
EXPLAIN indexes = 1. If the output shows all parts and most granules are read, the index is not helping. - Verify literal types. A string literal compared to a
DateTimecolumn, or an implicit cast, prevents index matching.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Query latency P99 | Tail latency reveals scans before averages degrade | P99 > 2x baseline for > 15 minutes |
read_rows per query vs table size | Direct evidence of a full table scan | Recurring queries reading > 80% of total rows |
| Long-running queries count | Full scans hold threads and memory | Queries running > 10x expected duration |
| CPU utilization | Full scans decompress every row, driving CPU load | CPU > 80% sustained without query volume increase |
| Active part count per partition | More parts amplify the cost when pruning fails | Per-partition part count > 100 and rising |
Fixes
Add a partition key filter
Rewrite the query to include the raw partition key column in the WHERE clause. Do not wrap it in a function. If the partition key is an expression such as toYYYYMM(event_time), filter on the expression exactly, or on a raw timestamp range that matches the target partition.
Tradeoff: This may require application changes to generate the partition filter.
Align with the primary key prefix
Ensure the WHERE clause filters the first column of the ORDER BY key, then the second, and so on. If your key is (project_id, event_time) but queries only filter on event_time, ClickHouse cannot use the sparse index to skip granules. Reordering the primary key requires recreating the table and re-inserting data.
WARNING: Recreating the table to change the ORDER BY key is destructive. Plan for downtime and a full data migration.
Tradeoff: If the access pattern is fixed, consider whether the table design matches the workload.
Remove functions and casts on key columns
If the WHERE clause transforms the partition or primary key column, move the logic to the literal side. For example, rewrite toDate(event_time) = '2024-01-01' as event_time >= '2024-01-01 00:00:00' AND event_time < '2024-01-02 00:00:00'.
Tradeoff: This may require storing denormalized columns or changing how literals are passed.
Match data types exactly
Ensure the literal in the WHERE clause matches the column type. Comparing a Date to a DateTime literal or an Int64 to a string prevents index matching even when the values appear equivalent.
Tradeoff: Usually a minor query rewrite. If the driver sends strings for all parameters, explicitly cast the literal or adjust driver configuration.
Prevention
Review new queries in CI for PARTITION BY and ORDER BY alignment before they reach production. Monitor the read_rows-to-table-size ratio for recurring query patterns and alert when it jumps. When designing tables, place the most selective and frequently filtered column first in the ORDER BY key, and keep the PARTITION BY expression simple so time-range queries can target it directly. Do not add hardware to fix latency before confirming pruning and index usage are correct.
How Netdata helps
- Use query latency charts alongside CPU and disk read throughput to distinguish a full table scan from connection saturation or queue backlog.
- Monitor long-running query counts to catch scans before they cascade into memory pressure.
- Alert on P99 query latency deviations that precede visible merge or part-count crises.
- Surface disk read throughput spikes that accompany full scans on large tables.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse detached parts piling up: reading system.detached_parts and reclaiming space
- ClickHouse disk space collapse: why merges need free space and how the spiral starts
- ClickHouse disk space monitoring: free_space, unreserved_space, and the 80% target
- ClickHouse distributed DDL stuck: ON CLUSTER queries that never finish
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse cannot connect to ZooKeeper/Keeper: diagnosing the coordination layer
- ClickHouse Keeper latency high: the early warning before sessions expire
- ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze







