ClickHouse monitoring checklist: the signals every production cluster needs

ClickHouse failures usually begin as storage-structure debt: immutable parts accumulate faster than merges consolidate them, coordination sessions expire, or disk space drops below the threshold merges need to complete. Query latency degrades only after the crisis is hours old.

This checklist groups monitoring signals into four maturity levels. Level 1 is the minimum viable instrumentation to avoid data loss and unavailability. Each subsequent level adds leading indicators that catch part accumulation, replication divergence, and memory pressure while they are still reversible. The signals are drawn from ClickHouse system tables and OS-level metrics. They apply to single-node, sharded, and replicated setups; replicated tables add ZooKeeper/Keeper signals that belong in Level 2.

flowchart TD
    L1["Level 1 - Survival"] --> L2["Level 2 - Operational"]
    L2 --> L3["Level 3 - Mature"]
    L3 --> L4["Level 4 - Expert"]

    L1 --- S1["Liveness, disk space, part count, insert rejects, memory, replica health"]
    L2 --- S2["Query latency, merge throughput, ZK latency, error codes, pool saturation"]
    L3 --- S3["Per-partition skew, mutation queue, FD usage, cache hits, DDL queue"]
    L4 --- S4["Merge duration trends, replica divergence, ZK metadata growth, system.errors"]

Level 1 - Survival

These are the minimum signals to detect catastrophic failure before it becomes unrecoverable. Start here.

  • Server liveness via /ping and SELECT 1. The /ping endpoint returns Ok. with HTTP 200 while the HTTP handler thread is alive, but it stops responding during metadata reload on startup and does not exercise the query pipeline. Pair it with SELECT 1 for readiness probes, especially in containerized deployments.

  • Disk free space from system.disks. Query free_space and unreserved_space per configured disk. Keep operational usage below 80-85%. Ensure free space remains large enough for merges to rewrite the largest active part; operators commonly reserve 2x that size as a safety margin, though this varies by workload. Below this threshold, halted merges cause parts to pile up and consume more metadata space.

  • Active part count via system.parts and MaxPartCountForPartition. Every INSERT creates immutable parts; background merges consolidate them. If parts accumulate faster than merges complete, query performance degrades and inserts are eventually rejected. Query system.parts with active = 1 grouped by database, table, partition_id:

SELECT database, table, partition_id, count() AS cnt
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id

Alert when any partition exceeds 50% of parts_to_delay_insert, and watch MaxPartCountForPartition in system.asynchronous_metrics for server-wide trending.

  • Insert rejections via system.query_log and system.events. Track insert failures where exception_code = 252 (TOO_MANY_PARTS) in system.query_log. Correlate with DelayedInserts in system.events, which fires earlier when the delay threshold is crossed. Any sustained non-zero rate of rejected inserts is a data-loss scenario.

  • Server memory via MemoryTracking in system.metrics. Compare this tracked allocation against max_server_memory_usage, which typically defaults to 90% of physical RAM. Alert when tracked memory exceeds 80% of the limit. Because ClickHouse’s allocator and mmap overhead are not fully tracked, also monitor OS-level RSS via MemoryResident in system.asynchronous_metrics or /proc/<pid>/status. RSS can exceed the tracked value by a significant margin.

  • Replica health via system.replicas. For replicated tables, is_readonly = 1 means the replica cannot accept writes, and is_session_expired = 1 indicates a lost ZooKeeper/Keeper session. Brief flickers during coordination leader elections are normal; sustained states lasting minutes indicate a partition from the coordination service or a configuration issue. Check last_queue_update_exception for repeated errors:

SELECT database, table, is_readonly, is_session_expired, last_queue_update_exception
FROM system.replicas
WHERE is_readonly = 1 OR is_session_expired = 1
  • Failed query rate via FailedQuery in system.events. Compute the ratio of failed queries to total queries. A sustained error rate above 1% warrants investigation. In system.query_log, filter on type = 'ExceptionWhileProcessing' and pay special attention to exception code 252 (too many parts) and code 241 (memory limit exceeded).

Level 2 - Operational

Add these signals to understand service quality, throughput trends, and background subsystem health.

  • Query latency distribution from system.query_log. Track p50, p95, and p99 with quantile over query_duration_ms. Tail latency rises before averages. Alert when p99 exceeds 2x your rolling baseline for more than 15 minutes. Exclude system and monitoring queries by filtering on is_initial_query = 1.

  • Insert latency from system.query_log. Rising insert latency precedes DelayedInserts by minutes to hours. Sustained p99 insert latency above 5x baseline signals the write pipeline is backing up. For replicated tables, ZooKeeper latency sits in the critical path and directly adds to insert duration.

  • Merge throughput from system.part_log and system.merges. Track completed merges per minute and bytes merged per second from system.part_log where event_type = 'MergeParts'. If merge throughput chronically falls below part creation rate, the system is heading for a part-count crisis. Complement this with live merge progress from system.merges.

  • Background pool utilization from system.metrics. Track active tasks via BackgroundMergesAndMutationsPoolTask and BackgroundFetchesPoolTask in system.metrics, and compare against your configured pool sizes. Sustained saturation means merges, fetches, or mutations are queuing instead of executing.

  • Replication queue depth from system.replicas and system.replication_queue. Watch queue_size, absolute_delay in seconds, and log_max_index - log_pointer for entries behind. In system.replication_queue, entries with num_tries > 5 and non-empty last_exception are stuck and will not self-resolve without intervention:

SELECT database, table, replica_name, node_name, num_tries, last_exception
FROM system.replication_queue
WHERE num_tries > 5 AND last_exception != ''
  • ZooKeeper/Keeper latency from system.events. Track ZooKeeperWaitMicroseconds. Healthy coordination latency stays below 10 ms; sustained latency above 30% of the session timeout risks session expiry and replica readonly states. For ClickHouse Keeper, also use the Prometheus endpoint on port 9181.

  • Disk I/O latency from the OS. Use iostat or /proc/diskstats. Latency above 20 ms on SSD indicates merges and queries are competing for bandwidth. This is a leading indicator for merge starvation even when the merge pool is not fully utilized.

  • Exception code breakdown from system.query_log. Group by exception_code to distinguish between part accumulation (252), memory limits (241), distributed timeouts (159), and coordination failures (999). This prevents you from treating a memory crisis as a query-tuning problem.

Level 3 - Mature

These signals expose skew, hidden bottlenecks, and resource exhaustion before they degrade user-visible performance.

  • Per-partition part count from system.parts. The insert delay and reject thresholds apply per partition, not per table. A table with 100 partitions can hide a single partition nearing its limit. Group by database, table, partition_id and monitor the maximum per partition. Remember that projections create additional hidden parts that count toward limits.

  • Part creation rate versus merge completion rate. Derive part creation from system.part_log insert events and merge completion from MergeParts events. A sustained ratio above 1.0 guarantees a future emergency. This is the single best leading indicator for the part accumulation cascade.

  • Mutation queue depth from system.mutations. Running mutations rewrite every part and block merges. Alert on is_done = 0 with parts_to_do flat for more than 30 minutes. If a mutation is monopolizing the pool and starving merges, use KILL MUTATION only after verifying the operation is non-critical and understanding the rollback implications.

  • Per-query memory from system.processes. Identify runaway queries before they trigger server-wide limits. Single queries using more than 50% of max_server_memory_usage should be flagged. Sort by memory_usage descending during incidents to find the culprit:

SELECT query_id, user, query, memory_usage
FROM system.processes
ORDER BY memory_usage DESC
  • File descriptor usage versus limit. Each active part opens one file per column plus metadata, and merge operations temporarily open all source and target files simultaneously. Monitor /proc/<pid>/fd against the nofile limit. Production deployments should set limits to at least 100000; alert when usage exceeds 70%.

  • Mark cache hit rate from system.events. Compute the ratio of MarkCacheHits to total hits plus misses. Declining rates below 80% indicate insufficient cache size or working-set growth, which increases disk seeks and query latency.

  • Distributed DDL queue from system.distributed_ddl_queue. Entries stuck in a non-Finished state for more than 5 minutes cause schema drift across the cluster. This is especially critical during deployments and schema migrations:

SELECT entry, status, query
FROM system.distributed_ddl_queue
WHERE status != 'Finished'

Level 4 - Expert

These signals support deep diagnosis, trend prediction, and root-cause extraction after repeated major incidents.

  • Merge duration trends from system.part_log. Track P99 merge duration over time. Increasing merge times indicate growing data volume, I/O contention, or scheduling pathologies 1-3 days before they become critical. This catches the merge debt spiral earlier than part count alone.

  • Cross-replica consistency checks. Periodically compare row counts, total bytes, and part counts across replicas using system.parts. Silent divergence produces zero standard alerts because the replication queue can be empty while data is missing. This is the only reliable way to detect disk corruption or accidental partial DDL on a single replica.

  • ZooKeeper znode and watch count trends. Growth in znodes and watches predicts coordination overload before latency spikes appear. Each replicated table contributes to this metadata footprint, and unbounded growth from excessive DDL or part metadata can push the coordination service into a choke spiral.

  • system.errors for internal failures. Some errors increment during background operations and never surface in system.query_log. Monitor this table for codes that are not benign to catch internal subsystem failures, though be aware that a small number of codes may increment during normal execution.

  • System table growth. system.query_log, system.part_log, and system.text_log are MergeTree tables that grow unbounded without TTL. On high-QPS systems they can consume significant disk space and even trigger too-many-parts conditions on themselves.

  • Inactive part cleanup rate. Monitor the size and count of non-active parts in system.parts. Slow cleanup indicates filesystem pressure or blocked background threads, which wastes disk space and inflates file descriptor usage.

  • Asynchronous insert buffer depth. If using async inserts, monitor buffered data that is acknowledged to the client but not yet persisted to parts. Buffer flush failures are a distinct failure mode from synchronous insert rejection.

How Netdata helps

  • Correlates ClickHouse internal signals with OS-level metrics (RSS vs MemoryTracking, iowait, TCP retransmissions, FD usage) on the same timeline.
  • Collects system.metrics, system.events, system.asynchronous_metrics, and system.parts without manual query configuration.
  • Visualizes part-count growth rates alongside merge throughput to surface merge-to-insert ratio trends before part limits are reached.
  • Supports composite alerts combining pool utilization, part count derivatives, and disk space to distinguish sustained pressure from transient batch loads.