ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze

INSERTs fail. Replicated tables flip read-only. ON CLUSTER DDL hangs. curl http://localhost:8123/ping still returns Ok. and SELECT 1 still works. This is Keeper saturation: the coordination layer is choking while liveness probes give false confidence.

The spiral starts when ZooKeeper or ClickHouse Keeper cannot keep up with metadata load. Every replicated table registers znodes and watches. Every DDL operation adds more. Coordination latency climbs until heartbeat traffic cannot complete within the negotiated session timeout; sessions expire, replicas become read-only, and writes fail. Reconnecting nodes then trigger a thundering herd.

Healthy operation latency stays below 10 ms P99. Past 10 ms the ensemble is under stress; past 100 ms it is severely degraded. Sessions expire when the client cannot exchange heartbeats within the session timeout window, which becomes likely once latency is sustained in the hundreds of milliseconds or when packet loss and GC pauses block the heartbeat path. The first response is always the same: stop all DDL.

flowchart TD
    A[Metadata explosion or DDL storm] --> B[ZK znode count grows]
    B --> C[Transaction log I/O saturated]
    C --> D[Severe latency spike blocks heartbeats]
    D --> E[Session timeouts]
    E --> F[Replicas read-only]
    F --> G[Cluster-wide write failure]
    G --> H[Reconnection storm]
    H --> C

What this means

ClickHouse relies on a coordination service for replicated tables, distributed DDL, and replication queues. When that service saturates, the database does not crash. It loses the ability to mutate replicated state. Replicas with expired sessions stop accepting writes. Distributed DDL tasks stall. Non-replicated tables keep working, which makes the outage scope confusing.

Common causes

CauseWhat it looks likeFirst thing to check
Metadata explosion from too many replicated tablessystem.zookeeper queries under /clickhouse/tables return huge child counts; latency climbs even with low query loadCount replicated tables and znodes under the configured root path
DDL stormMany CREATE, ALTER, or DROP operations running concurrently; system.distributed_ddl_queue has unfinished entriesQuery system.query_log for recent query_kind IN ('Alter', 'Create', 'Drop')
ZK transaction log disk bottleneckKeeper/ZooKeeper disk I/O is saturated while CPU is low; mntr shows high avg latencyDisk latency on the volume hosting the transaction log
Post-outage reconnection thundering herdZK recovered but ClickHouse nodes reconnect simultaneously, driving latency back upConnection count spikes on ZK ports after a recent leader election
JVM GC pauses (external ZooKeeper only)Latency spikes correlate with GC log entries; not applicable to ClickHouse KeeperJVM heap usage and GC logs on ZooKeeper nodes

Quick checks

Run these read-only probes to confirm a coordination saturation event.

-- Test live ZK connectivity from ClickHouse
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;
-- Check session health and expiration
SELECT host, port, is_expired, session_uptime_elapsed_seconds, session_timeout_ms
FROM system.zookeeper_connection;
-- Find read-only or session-expired replicas
SELECT database, table, is_leader, is_readonly, is_session_expired, total_replicas, active_replicas
FROM system.replicas;
-- Inspect replication queue for stuck entries
SELECT database, table, type, create_time, last_attempt_time, num_tries, last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
-- Check distributed DDL status where system.distributed_ddl_queue is available
SELECT entry, host_name, status, exception_text, query_create_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;
# Probe ClickHouse Keeper directly (4lw commands on port 9181)
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181
-- ClickHouse-side ZK event counters
SELECT event, value FROM system.events WHERE event LIKE 'ZooKeeper%';

How to diagnose it

  1. Confirm liveness is a false signal. Verify that curl http://localhost:8123/ping returns Ok. and that local SELECT queries execute. If the server is alive but replicated inserts fail, you are looking at coordination failure, not a process crash.
  2. Check coordination latency. Run SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1 and time it. Sub-100 ms is normal; multi-second response or timeout confirms saturation.
  3. Inspect replica sessions. Query system.replicas for is_readonly = 1 or is_session_expired = 1. If these are widespread, the cluster has already lost its sessions.
  4. Identify the metadata load. Count recent DDL in system.query_log. Check system.zookeeper for znode counts under the configured table root. A sudden jump implicates a DDL storm or table proliferation.
  5. Examine queues. system.replication_queue entries with high num_tries indicate followers are retrying failed operations. system.distributed_ddl_queue entries stuck in a non-Finished status confirm DDL is backing up.
  6. Probe Keeper or ZooKeeper directly. Use mntr to read average latency, outstanding requests, and open file descriptor count on the coordination nodes. If disk latency on the transaction log volume is high, storage is the bottleneck.
  7. Look for the thundering herd. If the event started after a ZK restart or leader election, check whether ClickHouse nodes are reconnecting simultaneously. system.zookeeper_connection session uptimes that cluster around the same second suggest a herd.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ZK/Keeper operation latency P99Predicts session timeouts before they happenP99 > 10 ms sustained; approaching 100 ms is critical
Replica is_readonly stateDirect indicator that a replica has stopped accepting writesAny sustained is_readonly = 1 for more than 5 minutes
Replica is_session_expired stateShows the replica has lost coordination sessionAny non-zero value is abnormal
Replication queue depthMeasures how far followers are falling behindGrowing steadily over 15 minutes
Distributed DDL queue statusReveals schema changes stuck across the clusterEntries in non-Finished status longer than 5 minutes
ZooKeeperUserExceptions / ZooKeeperHardwareExceptions rateClient-side view of coordination failuresAny sustained increase from baseline
ZK mntr average latencyEnsemble-level health independent of ClickHouse metricsAverage latency rising toward the session timeout threshold

Fixes

Stop all DDL immediately

If a DDL storm is active, halt every CREATE, ALTER, and DROP operation. DDL creates znodes, which deepens saturation. Wait for coordination latency to drop and the distributed DDL queue to drain before resuming schema changes. Do this first in an active spiral.

Reduce metadata pressure

If the root cause is thousands of replicated tables, reduce the znode count. Drop unused replicated tables or consolidate them. This is not a quick fix: dropping replicated tables also touches the coordination service. Spread the work across maintenance windows.

Relieve ZK disk I/O

Move the ZooKeeper transaction log or ClickHouse Keeper log and snapshot storage to a dedicated SSD or NVMe volume. Do not share the disk with ClickHouse data directories. The transaction log is written synchronously, so spinning disks or contended volumes are a common bottleneck. If you are using external ZooKeeper on a shared node, isolate it.

Break the thundering herd

If ClickHouse nodes are hammering a recovering coordination service with reconnections, do not restart ClickHouse processes. Restarts generate fresh session registration and watch storms, which amplify load. Let existing sessions settle. If individual nodes are stuck with expired sessions and the coordination service is already healthy, a rolling restart of ClickHouse may force clean reconnection, but only after latency has returned to normal. Rolling restarts are disruptive; schedule them carefully.

Address JVM GC (external ZooKeeper only)

If you run ZooKeeper rather than ClickHouse Keeper, check JVM heap usage and GC logs. GC pauses directly manifest as latency spikes. Tune heap size or consider migrating to ClickHouse Keeper, which is not JVM-based and eliminates this failure mode.

Prevention

  • Limit replicated table sprawl. Every replicated table creates permanent metadata overhead in the coordination service. Use non-replicated MergeTree where high availability is not required.
  • Serialize DDL. DDL queues are processed sequentially per node, and a burst of commands creates a metadata backlog that saturates Keeper.
  • Monitor coordination latency as a first-class metric. Alert on P99 > 10 ms because the 10 ms to 100 ms range is the danger zone where sessions begin expiring.
  • Give Keeper dedicated fast disks. The transaction log is written synchronously, so storage contention on the coordination node directly adds latency to every replicated operation.
  • Right-size the ensemble. Plan coordination capacity for peak znode and watch count, not average load, to absorb table growth and DDL bursts.

How Netdata helps

  • Correlate ZooKeeperWaitMicroseconds from system.events with replica is_readonly state to distinguish a coordination saturation spiral from a transient network blip.
  • Cross-reference system.replication_queue depth against distributed DDL queue status to identify whether a DDL storm is driving the outage.
  • Track disk I/O on Keeper nodes separately from ClickHouse data volumes to catch transaction-log bottlenecks before they propagate into session timeouts.
  • Alert on is_session_expired from system.replicas as an early indicator that the spiral has begun and writes are at risk.
  • Monitor inter-server network metrics alongside ZK latency to rule out network partition when replicas lose connectivity.