ClickHouse monitoring maturity model: from survival to expert
Most production ClickHouse incidents are not mysterious. They are predictable storage-structure or coordination failures that better monitoring would have surfaced hours earlier. If you are running ClickHouse at scale, you need to know whether your observability is actually catching the failure modes that matter, or just proving that the process is running.
This maturity model is a diagnostic mirror, not a trophy case. Use it to audit your dashboards, tune alert noise, and decide what to instrument next. Each level builds on the last. Skipping levels leaves predictable gaps: teams with beautiful query-latency dashboards still get surprised by merge death spirals because they never instrumented part counts per partition.
The four levels move from “is it alive?” to “when will it break and why?”.
flowchart LR
D[Level 1 Survival
Uptime and catastrophic failure] --> C[Level 2 Operational
Service quality and trends]
C --> B[Level 3 Mature
Proactive identification]
B --> A[Level 4 Expert
Trend prediction and root cause]Level 1 — Survival
The bare minimum is knowing if the server is up and whether it is rejecting writes.
Monitor HTTP ping on port 8123 or TCP liveness on port 9000. Complement this with SELECT 1, which exercises the query pipeline and catches more degradation than ping alone. Track the process via standard host monitoring.
Watch disk space on data volumes and keep usage below 90%. ClickHouse does not handle disk-full gracefully. Merges need temporary headroom to write combined parts before deleting sources, so free space is an operational dependency, not just a capacity metric.
From system.events, alert on RejectedInserts. Any sustained non-zero value means inserts are being dropped because a partition has exceeded parts_to_throw_insert (default 3000). This is the single most common production failure and it is binary: if the counter is increasing, you are losing data.
Use MaxPartCountForPartition from system.asynchronous_metrics as a single-number proxy for part pressure across the server. For replicated setups, monitor is_session_expired in system.replicas. A non-zero value means the replica has lost its coordination service session and cannot participate in replication.
Finally, track FailedQuery against total Query from system.events. This level is intentionally minimal. It ensures uptime and catches catastrophic write rejection.
Level 2 — Operational
At this level, the team knows service quality and can see trends developing before they become pages.
Monitor active part count per table and per partition from system.parts where active = 1. The critical thresholds are per-partition: parts_to_delay_insert (default 1000) and parts_to_throw_insert (default 3000). A table with 50 partitions and 50 parts each is healthy; one partition at 290 parts is a crisis. Track DelayedInserts from system.events as the early warning.
Watch merge activity via system.merges. If parts are elevated and zero merges are running while inserts are active, the background merge pool is stuck or starved. Track background pool utilization from system.metrics using metrics such as BackgroundMergesAndMutationsPoolTask.
Track memory from two angles: MemoryTracking in system.metrics and MemoryResident in system.asynchronous_metrics. Divergence between them indicates allocator fragmentation or untracked allocations.
Measure query latency P99 and insert latency from system.query_log. For replication, track absolute_delay in system.replicas for wall-clock lag, and is_readonly for write availability. Monitor system.disks for free_space and unreserved_space, not just host-level disk metrics.
Break down query errors by exception code from system.query_log. Codes 252 (TOO_MANY_PARTS) and 241 (MEMORY_LIMIT_EXCEEDED) should never be ignored. This level correlates write health, merge health, and query performance into a coherent operational picture.
Level 3 — Mature
Mature monitoring spots silent killers before they cause user-visible impact.
Move from table-level aggregates to per-partition part counts. A partition with ten times the part count of its peers indicates skewed ingestion or a stuck merge. Monitor the mutation queue in system.mutations where is_done = 0. Track parts_to_do over time; if it is flat for more than 30 minutes, the mutation is stuck. Mutations monopolize merge pool threads and silently block part consolidation.
Inspect system.replication_queue for stuck entries. Any task with num_tries > 5 and a non-empty last_exception is not going to self-resolve. Break down queue entries by type (GET_PART, MERGE_PARTS, MUTATE_PART) to identify whether the bottleneck is network fetches or local merge pressure.
Track per-query memory consumption from system.processes to catch runaway queries before they trigger server-wide limits. Monitor file descriptor usage via OpenFileForRead and OpenFileForWrite in system.metrics, and compare against the process nofile limit. Each part opens files per column; wide tables exhaust limits quickly.
For replicated clusters, monitor ZooKeeper or ClickHouse Keeper connection health via system.zookeeper_connection or by querying system.zookeeper. Watch system.distributed_ddl_queue for schema changes that are not reaching Finished on all nodes. Track mark cache hit rates from MarkCacheHits and MarkCacheMisses in system.events. Monitor system.detached_parts for parts removed due to corruption or failed fetches.
Level 4 — Expert
Expert monitoring answers why an incident happened and predicts the next one.
Analyze per-partition skew and part-count growth rates. If the derivative of new parts exceeds the merge completion rate for more than a few observation windows, a rejection event is predictable hours in advance. Track merge duration trends from system.part_log. Rising P99 merge duration is a leading indicator of I/O contention or data skew by one to three days.
Monitor spill-to-disk activity by watching /var/lib/clickhouse/tmp/ for temporary files created during large aggregations or ORDER BY operations. Track jemalloc statistics from system.asynchronous_metrics such as jemalloc.allocated, jemalloc.active, and jemalloc.retained. When resident memory far exceeds MemoryTracking, allocator fragmentation is the culprit. SYSTEM JEMALLOC PURGE can force dirty page release.
Measure distributed query performance per shard to isolate slow replicas that cluster-wide averages hide. Track cache hit rates for both mark cache and uncompressed cache. Monitor ZooKeeper znode count and watch count trending; rapid growth predicts coordination overload before sessions start expiring.
Use system.part_log to reconstruct merge and mutation scheduling pathologies after incidents. Inspect system.errors for internal subsystem failures that never appear in query logs. Correlate signal changes with deployment times. Run automated CHECK TABLE verification to detect silent data divergence between replicas. Analyze system.session_log for authentication anomalies.
This level turns monitoring from an alerting system into a diagnostic instrument.
How Netdata helps
Netdata collects ClickHouse internal metrics from system.metrics, system.events, and system.asynchronous_metrics automatically. It correlates database signals with host-level metrics (disk I/O latency, memory RSS, file descriptor usage, CPU iowait) in unified charts, so you can see whether a merge slowdown is a ClickHouse scheduling issue or disk saturation.
Anomaly detection on RejectedInserts, DelayedInserts, and replication lag can fire before query latency degrades. Per-process memory tracking and query-level attribution help isolate runaway queries without repeatedly polling system.processes. In distributed deployments, per-node and per-shard views reveal skew that cluster-wide averages obscure.







