ClickHouse full table scan: partition pruning failures and the primary key

A query that returned in milliseconds yesterday now takes minutes and saturates CPU. Check whether it is reading every row before you add vCPUs or shards. A missing filter on the partition key or a mismatch with the primary key prefix forces ClickHouse to open every part and scan every granule. Symptoms are runaway read_rows counts, tail latency spikes, and sustained CPU decompression that hardware scaling cannot fix.

ClickHouse is not a row-store with dense B-tree indexes. It uses a sparse primary index built from the ORDER BY key and directory-level pruning via the PARTITION BY clause. When these are unused, the engine reads all column files, decompresses every granule, and filters row by row. system.query_log will show read_rows near the total table row count. The fix is in the query shape, not cluster size.

Start by looking for queries where read_rows equals or exceeds the target table’s row count. That ratio is the smoking gun.

What this means

ClickHouse stores data in immutable parts on disk. Each table has two layers of skipping logic.

The partition key (PARTITION BY) determines which parts contain relevant data. If the WHERE clause includes the partition key column, ClickHouse prunes entire directories before opening files. Without it, every active part is a candidate.

Inside each part, the sparse primary index (the ORDER BY key) stores one entry per granule. If the WHERE clause filters on the first column of the ORDER BY key, ClickHouse narrows the granule range. If the filter skips the first column but uses the second, the index cannot narrow the range because the data is sorted lexicographically by the full key prefix. The result is a full granule scan across all selected parts.

A full table scan is a silent performance failure, not a syntax error. It shows up as high read_rows, elevated CPU from decompression, and disk I/O that looks like saturation but is wasted work.

flowchart TD
    A[Query arrives] --> B{WHERE has partition key filter?}
    B -->|No| C[Open all parts]
    B -->|Yes| D[Skip irrelevant partitions]
    C --> E{WHERE has primary key prefix filter?}
    D --> E
    E -->|No| F[Scan all granules in selected parts]
    E -->|Yes| G[Skip granules using sparse index]
    F --> H[High read_rows, slow latency]
    G --> I[Low read_rows, fast latency]

Common causes

CauseWhat it looks likeFirst thing to check
Missing partition key filterread_rows equals total table rows; latency spikes on time-range queriesSHOW CREATE TABLE to confirm PARTITION BY column
Primary key prefix skippedWHERE filters a later ORDER BY column without filtering the first; index is unusedSHOW CREATE TABLE to confirm ORDER BY key order
Function or expression on key columnWHERE wraps the partition or primary key in a function or castEXPLAIN indexes = 1 to verify key conditions
Data type mismatchDate compared to DateTime or string literal compared to integer; equality looks correct but types differSchema column types versus query literals
OR conditions across key columnsWHERE a = 1 OR b = 2 prevents the sparse index from narrowing the granule rangeEXPLAIN indexes = 1 shows wide granule scans

Quick checks

Run these safe, read-only checks during an incident to confirm whether a full table scan is the culprit.

-- Check table engine, partition key, and primary key order
SHOW CREATE TABLE db.table;
-- Find recent queries that read an enormous number of rows
SELECT
    query_id,
    substring(query, 1, 120) AS query_prefix,
    read_rows,
    query_duration_ms / 1000 AS duration_sec
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 10 MINUTE
ORDER BY query_duration_ms DESC
LIMIT 10;
-- See currently running heavy queries and their row progress
SELECT
    query_id,
    elapsed,
    read_rows,
    substring(query, 1, 120) AS query_prefix
FROM system.processes
WHERE read_rows > 0
ORDER BY read_rows DESC
LIMIT 10;
-- Total rows for comparison against read_rows
SELECT count() FROM db.table;
-- Row count per partition from system.parts (faster on very large tables)
SELECT partition, sum(rows) AS rows
FROM system.parts
WHERE database = 'db' AND table = 'table' AND active
GROUP BY partition
ORDER BY partition;
-- Verify whether ClickHouse uses partition and primary key indexes
EXPLAIN indexes = 1
SELECT * FROM db.table WHERE ...;

How to diagnose it

  1. Find the slow query in system.processes or system.query_log. Note query_id, read_rows, and duration.
  2. Compare read_rows to the table row count. A ratio above 0.8 usually means a full table or full partition scan.
  3. Run SHOW CREATE TABLE. Record the PARTITION BY and ORDER BY columns exactly.
  4. Check the WHERE clause for the partition key. If it is absent or wrapped in a function, pruning failed.
  5. Check the WHERE clause against the ORDER BY prefix. If the first key column is missing or transformed, the sparse index is unused.
  6. Run EXPLAIN indexes = 1. If the output shows all parts and most granules are read, the index is not helping.
  7. Verify literal types. A string literal compared to a DateTime column, or an implicit cast, prevents index matching.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Query latency P99Tail latency reveals scans before averages degradeP99 > 2x baseline for > 15 minutes
read_rows per query vs table sizeDirect evidence of a full table scanRecurring queries reading > 80% of total rows
Long-running queries countFull scans hold threads and memoryQueries running > 10x expected duration
CPU utilizationFull scans decompress every row, driving CPU loadCPU > 80% sustained without query volume increase
Active part count per partitionMore parts amplify the cost when pruning failsPer-partition part count > 100 and rising

Fixes

Add a partition key filter

Rewrite the query to include the raw partition key column in the WHERE clause. Do not wrap it in a function. If the partition key is an expression such as toYYYYMM(event_time), filter on the expression exactly, or on a raw timestamp range that matches the target partition.

Tradeoff: This may require application changes to generate the partition filter.

Align with the primary key prefix

Ensure the WHERE clause filters the first column of the ORDER BY key, then the second, and so on. If your key is (project_id, event_time) but queries only filter on event_time, ClickHouse cannot use the sparse index to skip granules. Reordering the primary key requires recreating the table and re-inserting data.

WARNING: Recreating the table to change the ORDER BY key is destructive. Plan for downtime and a full data migration.

Tradeoff: If the access pattern is fixed, consider whether the table design matches the workload.

Remove functions and casts on key columns

If the WHERE clause transforms the partition or primary key column, move the logic to the literal side. For example, rewrite toDate(event_time) = '2024-01-01' as event_time >= '2024-01-01 00:00:00' AND event_time < '2024-01-02 00:00:00'.

Tradeoff: This may require storing denormalized columns or changing how literals are passed.

Match data types exactly

Ensure the literal in the WHERE clause matches the column type. Comparing a Date to a DateTime literal or an Int64 to a string prevents index matching even when the values appear equivalent.

Tradeoff: Usually a minor query rewrite. If the driver sends strings for all parameters, explicitly cast the literal or adjust driver configuration.

Prevention

Review new queries in CI for PARTITION BY and ORDER BY alignment before they reach production. Monitor the read_rows-to-table-size ratio for recurring query patterns and alert when it jumps. When designing tables, place the most selective and frequently filtered column first in the ORDER BY key, and keep the PARTITION BY expression simple so time-range queries can target it directly. Do not add hardware to fix latency before confirming pruning and index usage are correct.

How Netdata helps

  • Use query latency charts alongside CPU and disk read throughput to distinguish a full table scan from connection saturation or queue backlog.
  • Monitor long-running query counts to catch scans before they cascade into memory pressure.
  • Alert on P99 query latency deviations that precede visible merge or part-count crises.
  • Surface disk read throughput spikes that accompany full scans on large tables.