ClickHouse monitoring checklist: the signals every production cluster needs
ClickHouse failures usually begin as storage-structure debt: immutable parts accumulate faster than merges consolidate them, coordination sessions expire, or disk space drops below the threshold merges need to complete. Query latency degrades only after the crisis is hours old.
This checklist groups monitoring signals into four maturity levels. Level 1 is the minimum viable instrumentation to avoid data loss and unavailability. Each subsequent level adds leading indicators that catch part accumulation, replication divergence, and memory pressure while they are still reversible. The signals are drawn from ClickHouse system tables and OS-level metrics. They apply to single-node, sharded, and replicated setups; replicated tables add ZooKeeper/Keeper signals that belong in Level 2.
flowchart TD
L1["Level 1 - Survival"] --> L2["Level 2 - Operational"]
L2 --> L3["Level 3 - Mature"]
L3 --> L4["Level 4 - Expert"]
L1 --- S1["Liveness, disk space, part count, insert rejects, memory, replica health"]
L2 --- S2["Query latency, merge throughput, ZK latency, error codes, pool saturation"]
L3 --- S3["Per-partition skew, mutation queue, FD usage, cache hits, DDL queue"]
L4 --- S4["Merge duration trends, replica divergence, ZK metadata growth, system.errors"]Level 1 - Survival
These are the minimum signals to detect catastrophic failure before it becomes unrecoverable. Start here.
Server liveness via
/pingandSELECT 1. The/pingendpoint returnsOk.with HTTP 200 while the HTTP handler thread is alive, but it stops responding during metadata reload on startup and does not exercise the query pipeline. Pair it withSELECT 1for readiness probes, especially in containerized deployments.Disk free space from
system.disks. Queryfree_spaceandunreserved_spaceper configured disk. Keep operational usage below 80-85%. Ensure free space remains large enough for merges to rewrite the largest active part; operators commonly reserve 2x that size as a safety margin, though this varies by workload. Below this threshold, halted merges cause parts to pile up and consume more metadata space.Active part count via
system.partsandMaxPartCountForPartition. Every INSERT creates immutable parts; background merges consolidate them. If parts accumulate faster than merges complete, query performance degrades and inserts are eventually rejected. Querysystem.partswithactive = 1grouped bydatabase, table, partition_id:
SELECT database, table, partition_id, count() AS cnt
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
Alert when any partition exceeds 50% of parts_to_delay_insert, and watch MaxPartCountForPartition in system.asynchronous_metrics for server-wide trending.
Insert rejections via
system.query_logandsystem.events. Track insert failures whereexception_code = 252(TOO_MANY_PARTS) insystem.query_log. Correlate withDelayedInsertsinsystem.events, which fires earlier when the delay threshold is crossed. Any sustained non-zero rate of rejected inserts is a data-loss scenario.Server memory via
MemoryTrackinginsystem.metrics. Compare this tracked allocation againstmax_server_memory_usage, which typically defaults to 90% of physical RAM. Alert when tracked memory exceeds 80% of the limit. Because ClickHouse’s allocator and mmap overhead are not fully tracked, also monitor OS-level RSS viaMemoryResidentinsystem.asynchronous_metricsor/proc/<pid>/status. RSS can exceed the tracked value by a significant margin.Replica health via
system.replicas. For replicated tables,is_readonly = 1means the replica cannot accept writes, andis_session_expired = 1indicates a lost ZooKeeper/Keeper session. Brief flickers during coordination leader elections are normal; sustained states lasting minutes indicate a partition from the coordination service or a configuration issue. Checklast_queue_update_exceptionfor repeated errors:
SELECT database, table, is_readonly, is_session_expired, last_queue_update_exception
FROM system.replicas
WHERE is_readonly = 1 OR is_session_expired = 1
- Failed query rate via
FailedQueryinsystem.events. Compute the ratio of failed queries to total queries. A sustained error rate above 1% warrants investigation. Insystem.query_log, filter ontype = 'ExceptionWhileProcessing'and pay special attention to exception code 252 (too many parts) and code 241 (memory limit exceeded).
Level 2 - Operational
Add these signals to understand service quality, throughput trends, and background subsystem health.
Query latency distribution from
system.query_log. Track p50, p95, and p99 withquantileoverquery_duration_ms. Tail latency rises before averages. Alert when p99 exceeds 2x your rolling baseline for more than 15 minutes. Exclude system and monitoring queries by filtering onis_initial_query = 1.Insert latency from
system.query_log. Rising insert latency precedesDelayedInsertsby minutes to hours. Sustained p99 insert latency above 5x baseline signals the write pipeline is backing up. For replicated tables, ZooKeeper latency sits in the critical path and directly adds to insert duration.Merge throughput from
system.part_logandsystem.merges. Track completed merges per minute and bytes merged per second fromsystem.part_logwhereevent_type = 'MergeParts'. If merge throughput chronically falls below part creation rate, the system is heading for a part-count crisis. Complement this with live merge progress fromsystem.merges.Background pool utilization from
system.metrics. Track active tasks viaBackgroundMergesAndMutationsPoolTaskandBackgroundFetchesPoolTaskinsystem.metrics, and compare against your configured pool sizes. Sustained saturation means merges, fetches, or mutations are queuing instead of executing.Replication queue depth from
system.replicasandsystem.replication_queue. Watchqueue_size,absolute_delayin seconds, andlog_max_index - log_pointerfor entries behind. Insystem.replication_queue, entries withnum_tries > 5and non-emptylast_exceptionare stuck and will not self-resolve without intervention:
SELECT database, table, replica_name, node_name, num_tries, last_exception
FROM system.replication_queue
WHERE num_tries > 5 AND last_exception != ''
ZooKeeper/Keeper latency from
system.events. TrackZooKeeperWaitMicroseconds. Healthy coordination latency stays below 10 ms; sustained latency above 30% of the session timeout risks session expiry and replica readonly states. For ClickHouse Keeper, also use the Prometheus endpoint on port 9181.Disk I/O latency from the OS. Use
iostator/proc/diskstats. Latency above 20 ms on SSD indicates merges and queries are competing for bandwidth. This is a leading indicator for merge starvation even when the merge pool is not fully utilized.Exception code breakdown from
system.query_log. Group byexception_codeto distinguish between part accumulation (252), memory limits (241), distributed timeouts (159), and coordination failures (999). This prevents you from treating a memory crisis as a query-tuning problem.
Level 3 - Mature
These signals expose skew, hidden bottlenecks, and resource exhaustion before they degrade user-visible performance.
Per-partition part count from
system.parts. The insert delay and reject thresholds apply per partition, not per table. A table with 100 partitions can hide a single partition nearing its limit. Group bydatabase, table, partition_idand monitor the maximum per partition. Remember that projections create additional hidden parts that count toward limits.Part creation rate versus merge completion rate. Derive part creation from
system.part_loginsert events and merge completion fromMergePartsevents. A sustained ratio above 1.0 guarantees a future emergency. This is the single best leading indicator for the part accumulation cascade.Mutation queue depth from
system.mutations. Running mutations rewrite every part and block merges. Alert onis_done = 0withparts_to_doflat for more than 30 minutes. If a mutation is monopolizing the pool and starving merges, useKILL MUTATIONonly after verifying the operation is non-critical and understanding the rollback implications.Per-query memory from
system.processes. Identify runaway queries before they trigger server-wide limits. Single queries using more than 50% ofmax_server_memory_usageshould be flagged. Sort bymemory_usagedescending during incidents to find the culprit:
SELECT query_id, user, query, memory_usage
FROM system.processes
ORDER BY memory_usage DESC
File descriptor usage versus limit. Each active part opens one file per column plus metadata, and merge operations temporarily open all source and target files simultaneously. Monitor
/proc/<pid>/fdagainst thenofilelimit. Production deployments should set limits to at least 100000; alert when usage exceeds 70%.Mark cache hit rate from
system.events. Compute the ratio ofMarkCacheHitsto total hits plus misses. Declining rates below 80% indicate insufficient cache size or working-set growth, which increases disk seeks and query latency.Distributed DDL queue from
system.distributed_ddl_queue. Entries stuck in a non-Finished state for more than 5 minutes cause schema drift across the cluster. This is especially critical during deployments and schema migrations:
SELECT entry, status, query
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
Level 4 - Expert
These signals support deep diagnosis, trend prediction, and root-cause extraction after repeated major incidents.
Merge duration trends from
system.part_log. Track P99 merge duration over time. Increasing merge times indicate growing data volume, I/O contention, or scheduling pathologies 1-3 days before they become critical. This catches the merge debt spiral earlier than part count alone.Cross-replica consistency checks. Periodically compare row counts, total bytes, and part counts across replicas using
system.parts. Silent divergence produces zero standard alerts because the replication queue can be empty while data is missing. This is the only reliable way to detect disk corruption or accidental partial DDL on a single replica.ZooKeeper znode and watch count trends. Growth in znodes and watches predicts coordination overload before latency spikes appear. Each replicated table contributes to this metadata footprint, and unbounded growth from excessive DDL or part metadata can push the coordination service into a choke spiral.
system.errorsfor internal failures. Some errors increment during background operations and never surface insystem.query_log. Monitor this table for codes that are not benign to catch internal subsystem failures, though be aware that a small number of codes may increment during normal execution.System table growth.
system.query_log,system.part_log, andsystem.text_logare MergeTree tables that grow unbounded without TTL. On high-QPS systems they can consume significant disk space and even trigger too-many-parts conditions on themselves.Inactive part cleanup rate. Monitor the size and count of non-active parts in
system.parts. Slow cleanup indicates filesystem pressure or blocked background threads, which wastes disk space and inflates file descriptor usage.Asynchronous insert buffer depth. If using async inserts, monitor buffered data that is acknowledged to the client but not yet persisted to parts. Buffer flush failures are a distinct failure mode from synchronous insert rejection.
How Netdata helps
- Correlates ClickHouse internal signals with OS-level metrics (RSS vs
MemoryTracking,iowait, TCP retransmissions, FD usage) on the same timeline. - Collects
system.metrics,system.events,system.asynchronous_metrics, andsystem.partswithout manual query configuration. - Visualizes part-count growth rates alongside merge throughput to surface merge-to-insert ratio trends before part limits are reached.
- Supports composite alerts combining pool utilization, part count derivatives, and disk space to distinguish sustained pressure from transient batch loads.







