How ClickHouse actually works in production: a mental model for operators

ClickHouse is a column-oriented OLAP database built around the MergeTree family of table engines. It does not fail like a transactional database. Its failure modes are distinctive and almost always originate in the storage layer rather than the query layer. In production, most incidents begin as storage-structure debt: immutable data parts accumulate faster than background processes can consolidate them. By the time queries slow down or inserts fail, the underlying merge crisis has been developing for hours.

You cannot operate ClickHouse safely without internalizing three ideas: the part lifecycle, the background thread pools, and how replication coordinates through Keeper. These abstractions explain why a healthy server can suddenly reject inserts, why memory pressure cascades into disk I/O saturation, and why a slow coordination service can freeze an entire cluster while local queries continue to return results.

This article gives you the minimal operational model needed to run ClickHouse in production without surprise. It is not a tuning guide. It is the foundation that makes tuning guides comprehensible.

What it is and why it matters

The central abstraction is the MergeTree engine. Every INSERT creates one or more immutable data parts on disk. Each part is a self-contained directory containing column files, index files, checksums, and metadata. Once written, a part never changes. ClickHouse continuously runs background merges that combine smaller parts into larger ones. This is the “Merge” in MergeTree.

Fewer parts means fewer file descriptors, faster queries with fewer seeks, and smaller index overhead. When parts accumulate faster than merges consolidate them, the system degrades and eventually refuses inserts entirely. This “too many parts” failure is the single most common ClickHouse production incident. Understanding this lifecycle is the foundation for every operational decision, from partition design to hardware sizing.

How it works

Part lifecycle and columnar storage

Every INSERT into a MergeTree table creates new immutable data parts. Each part is a directory on disk containing one .bin file per column for compressed column data, .mrk3 mark files storing granule offsets, a primary.idx sparse primary index built over 8192-row granules, and supporting metadata including columns.txt, count.txt, and checksums.txt. Parts are sorted by the table’s ORDER BY key.

Because parts are immutable, UPDATE and DELETE operations cannot modify rows in place. Instead, ClickHouse implements them as asynchronous mutations that read entire existing parts, apply the change, and write new replacement parts. This design favors fast analytical reads at the cost of significant write amplification for mutations.

Background thread pools

ClickHouse isolates background work into distinct thread pools configured at the server level. The background merge and mutation pool is the workhorse. It is controlled by background_merges_mutations_concurrency_ratio (default: 2). This pool continuously evaluates which parts to merge based on size and age, combining smaller parts into larger ones. There is no fixed merge schedule and no exposed “merge queue depth” metric. You infer pressure by watching active part counts and pool utilization.

Separate pools handle moves between storage tiers, fetches from peer replicas in replicated setups, and scheduled housekeeping such as distributed sends and buffer flushes. These pools are finite and shared across all tables on the server. A single table with a heavy mutation backlog can starve every other table of merge capacity.

ReplicatedMergeTree and Keeper coordination

Replication relies on a coordination service. While earlier deployments used ZooKeeper, current production clusters commonly run ClickHouse Keeper, a Raft-based replacement. Each replicated table maintains a replication queue in the coordination service that tracks what each replica must do: execute a merge, fetch a part from a peer, or apply a mutation. The leader replica schedules merges and creates log entries in Keeper.

Replicas communicate directly to transfer parts, typically over the interserver port. If a replica loses its session to the coordination service, it becomes read-only. If the coordination service itself becomes slow or unavailable, replicas stop converging, replicated inserts may fail, and distributed DDL operations hang. The coordination layer is a single point of failure for the entire replicated cluster.

flowchart TD
    Insert([INSERT]) -->|creates| Part[New immutable part]
    Part -->|adds to| Active[Active parts]
    Active -->|fed to| Pool[Merge pool]
    Pool -->|writes| Merged[Merged part]
    Merged -->|replaces| Active
    Replicated[Replicated table] -->|session| Keeper[Keeper or ZK]
    Keeper -->|assigns| RepQueue[Replication queue]
    RepQueue -->|triggers| FetchPool[Fetch pool]
    FetchPool -->|downloads| Part

Resource competition

ClickHouse performance is determined by how these subsystems compete for shared resources.

ResourceContention pattern
Disk I/OReads from queries and writes from merges and inserts compete constantly. Merges are write-heavy sequential I/O; queries are read-heavy.
MemoryQuery execution buffers, GROUP BY and JOIN hash tables, the mark cache, uncompressed block cache, and background merge operations all share the same pool.
CPUDecompression, expression evaluation, and aggregation consume cycles for both queries and merges. Compression codecs trade CPU for reduced I/O.
File descriptorsEach part opens multiple files (one per column plus metadata). High part counts exhaust fd limits, and merges temporarily open all source and target files simultaneously.
Keeper sessionsReplicated tables maintain sessions to the coordination service. Session loss triggers read-only replica states. Overhead scales with the number of replicated tables.
NetworkReplication traffic between replicas, distributed query fanout, and inter-shard shuffles compete for bandwidth.

Where it shows up in production

These abstractions produce distinctive failure archetypes in production.

Merge debt spiral. When inserts arrive faster than the background merge pool can consolidate them, part count per partition climbs. Each merge must evaluate more candidates and open more files, so merge cost increases as the backlog grows. ClickHouse eventually enforces parts_to_delay_insert (default 1000) by artificially slowing inserts, then enforces parts_to_throw_insert (default 3000) by rejecting them entirely. This is a gradual onset failure that takes minutes to hours to mature.

Silent mutation blocking. An ALTER UPDATE or DELETE command creates a mutation that must rewrite every part in the table. Mutations share the merge pool and coordination paths with regular merges. While a heavy mutation runs, merges stall silently. Part count grows invisibly because system.merges shows activity (the mutation) even though regular consolidation has stopped. Operators often discover this only when insert delays begin.

Replication lag cascade. A network partition or Keeper latency spike isolates a replica. When connectivity returns, the replica must fetch all missing parts to catch up. If the backlog is large, the background fetch pool and disk I/O saturate. Large fetches compete with queries and merges on the recovering replica, slowing catch-up and allowing the queue to grow further.

Keeper choke. The coordination service stores metadata for every replicated table and part. Too many tables, excessive DDL, or slow transaction log I/O on Keeper nodes drive up operation latency. ClickHouse sessions begin timing out, causing replicas to enter read-only state. A reconnection storm after session expiry can amplify the load, creating a positive feedback loop. During this event, local queries may continue to work perfectly even though cluster-wide writes have halted.

Disk space collapse. Merges require temporary extra space to read all source parts and write the merged result before atomically replacing the old parts. When disk utilization crosses roughly 85 to 90 percent, ClickHouse may refuse to start new merges. Without merges running, small parts accumulate, consuming additional metadata space and accelerating the disk fill rate. TTL expiration also depends on merges, so old data persists longer than expected.

Tradeoffs and common misuses

  • Batched inserts. ClickHouse prefers batches of 1000 or more rows per INSERT. Many small inserts create excessive parts and overwhelm the merge pool.
  • Partition granularity. Over-partitioning (for example, by hour instead of by day) multiplies part counts across the same data volume. The insert throttle limits are per-partition, but metadata overhead scales with partition count.
  • Mutation abuse. Using frequent ALTER UPDATE or DELETE treats ClickHouse like an OLTP database. Each mutation rewrites every part in the table. On large tables, a single mutation can monopolize the merge pool for hours.
  • Replication overhead. Replicated tables add Keeper dependency and replication queue management. Non-replicated tables avoid this coordination tax but sacrifice high availability.
  • Pool sizing without I/O headroom. Increasing background_merges_mutations_concurrency_ratio when disk bandwidth is already saturated will not clear merge backlogs. The bottleneck must be identified before scaling the pool.

Signals to watch in production

SignalWhy it mattersWarning sign
Active part count per partitionThe distance from the “too many parts” cliff; limits apply per-partition, not per-tableCount exceeds 50% of parts_to_delay_insert (default 1000)
Background pool utilizationWhether merges, mutations, and fetches have capacity to complete workSustained utilization above 90% with growing part count
Merge activity and progressConfirms the merge subsystem is actually completing workNo merges running while parts are elevated; progress stuck across samples
Mutation statusMutations block merges and consume pool threads indefinitelyis_done = 0 with parts_to_do flat or not decreasing
Keeper connection healthCoordination loss makes replicated tables read-onlyis_expired = 1 in system.zookeeper_connection or connection test failure
Replication queue depthMeasures replica divergence and risk of stale dataQueue growing with entries showing num_tries > 5 and last_exception non-empty
Disk space utilizationMerges require temporary headroom to write output before deleting sourcesFree space below 20% or unreserved_space approaching zero
Memory tracking vs. RSSClickHouse-tracked memory underestimates actual RSS due to allocator overheadMemoryTracking diverges significantly from MemoryResident while approaching physical limits

How Netdata helps

  • Correlate active part count growth with background pool utilization and disk I/O latency on the same timeline to identify whether merges are CPU-bound, I/O-bound, or coordination-bound.
  • Track MemoryResident alongside MemoryTracking to catch allocator fragmentation and untracked memory growth before the OOM killer intervenes.
  • Monitor Keeper session state and ZooKeeper operation latency alongside replication queue depth to distinguish coordination failures from storage failures.
  • Alert on MaxPartCountForPartition from system.asynchronous_metrics to catch hotspot partitions before they trigger insert throttling.
  • Cross-reference query latency spikes with merge activity and mark cache hit rates to determine if slowdowns are caused by query load or background merge I/O competition.