ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze
INSERTs fail. Replicated tables flip read-only. ON CLUSTER DDL hangs. curl http://localhost:8123/ping still returns Ok. and SELECT 1 still works. This is Keeper saturation: the coordination layer is choking while liveness probes give false confidence.
The spiral starts when ZooKeeper or ClickHouse Keeper cannot keep up with metadata load. Every replicated table registers znodes and watches. Every DDL operation adds more. Coordination latency climbs until heartbeat traffic cannot complete within the negotiated session timeout; sessions expire, replicas become read-only, and writes fail. Reconnecting nodes then trigger a thundering herd.
Healthy operation latency stays below 10 ms P99. Past 10 ms the ensemble is under stress; past 100 ms it is severely degraded. Sessions expire when the client cannot exchange heartbeats within the session timeout window, which becomes likely once latency is sustained in the hundreds of milliseconds or when packet loss and GC pauses block the heartbeat path. The first response is always the same: stop all DDL.
flowchart TD
A[Metadata explosion or DDL storm] --> B[ZK znode count grows]
B --> C[Transaction log I/O saturated]
C --> D[Severe latency spike blocks heartbeats]
D --> E[Session timeouts]
E --> F[Replicas read-only]
F --> G[Cluster-wide write failure]
G --> H[Reconnection storm]
H --> CWhat this means
ClickHouse relies on a coordination service for replicated tables, distributed DDL, and replication queues. When that service saturates, the database does not crash. It loses the ability to mutate replicated state. Replicas with expired sessions stop accepting writes. Distributed DDL tasks stall. Non-replicated tables keep working, which makes the outage scope confusing.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Metadata explosion from too many replicated tables | system.zookeeper queries under /clickhouse/tables return huge child counts; latency climbs even with low query load | Count replicated tables and znodes under the configured root path |
| DDL storm | Many CREATE, ALTER, or DROP operations running concurrently; system.distributed_ddl_queue has unfinished entries | Query system.query_log for recent query_kind IN ('Alter', 'Create', 'Drop') |
| ZK transaction log disk bottleneck | Keeper/ZooKeeper disk I/O is saturated while CPU is low; mntr shows high avg latency | Disk latency on the volume hosting the transaction log |
| Post-outage reconnection thundering herd | ZK recovered but ClickHouse nodes reconnect simultaneously, driving latency back up | Connection count spikes on ZK ports after a recent leader election |
| JVM GC pauses (external ZooKeeper only) | Latency spikes correlate with GC log entries; not applicable to ClickHouse Keeper | JVM heap usage and GC logs on ZooKeeper nodes |
Quick checks
Run these read-only probes to confirm a coordination saturation event.
-- Test live ZK connectivity from ClickHouse
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;
-- Check session health and expiration
SELECT host, port, is_expired, session_uptime_elapsed_seconds, session_timeout_ms
FROM system.zookeeper_connection;
-- Find read-only or session-expired replicas
SELECT database, table, is_leader, is_readonly, is_session_expired, total_replicas, active_replicas
FROM system.replicas;
-- Inspect replication queue for stuck entries
SELECT database, table, type, create_time, last_attempt_time, num_tries, last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
-- Check distributed DDL status where system.distributed_ddl_queue is available
SELECT entry, host_name, status, exception_text, query_create_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;
# Probe ClickHouse Keeper directly (4lw commands on port 9181)
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181
-- ClickHouse-side ZK event counters
SELECT event, value FROM system.events WHERE event LIKE 'ZooKeeper%';
How to diagnose it
- Confirm liveness is a false signal. Verify that
curl http://localhost:8123/pingreturnsOk.and that localSELECTqueries execute. If the server is alive but replicated inserts fail, you are looking at coordination failure, not a process crash. - Check coordination latency. Run
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1and time it. Sub-100 ms is normal; multi-second response or timeout confirms saturation. - Inspect replica sessions. Query
system.replicasforis_readonly = 1oris_session_expired = 1. If these are widespread, the cluster has already lost its sessions. - Identify the metadata load. Count recent DDL in
system.query_log. Checksystem.zookeeperfor znode counts under the configured table root. A sudden jump implicates a DDL storm or table proliferation. - Examine queues.
system.replication_queueentries with highnum_triesindicate followers are retrying failed operations.system.distributed_ddl_queueentries stuck in a non-Finished status confirm DDL is backing up. - Probe Keeper or ZooKeeper directly. Use
mntrto read average latency, outstanding requests, and open file descriptor count on the coordination nodes. If disk latency on the transaction log volume is high, storage is the bottleneck. - Look for the thundering herd. If the event started after a ZK restart or leader election, check whether ClickHouse nodes are reconnecting simultaneously.
system.zookeeper_connectionsession uptimes that cluster around the same second suggest a herd.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| ZK/Keeper operation latency P99 | Predicts session timeouts before they happen | P99 > 10 ms sustained; approaching 100 ms is critical |
Replica is_readonly state | Direct indicator that a replica has stopped accepting writes | Any sustained is_readonly = 1 for more than 5 minutes |
Replica is_session_expired state | Shows the replica has lost coordination session | Any non-zero value is abnormal |
| Replication queue depth | Measures how far followers are falling behind | Growing steadily over 15 minutes |
| Distributed DDL queue status | Reveals schema changes stuck across the cluster | Entries in non-Finished status longer than 5 minutes |
ZooKeeperUserExceptions / ZooKeeperHardwareExceptions rate | Client-side view of coordination failures | Any sustained increase from baseline |
ZK mntr average latency | Ensemble-level health independent of ClickHouse metrics | Average latency rising toward the session timeout threshold |
Fixes
Stop all DDL immediately
If a DDL storm is active, halt every CREATE, ALTER, and DROP operation. DDL creates znodes, which deepens saturation. Wait for coordination latency to drop and the distributed DDL queue to drain before resuming schema changes. Do this first in an active spiral.
Reduce metadata pressure
If the root cause is thousands of replicated tables, reduce the znode count. Drop unused replicated tables or consolidate them. This is not a quick fix: dropping replicated tables also touches the coordination service. Spread the work across maintenance windows.
Relieve ZK disk I/O
Move the ZooKeeper transaction log or ClickHouse Keeper log and snapshot storage to a dedicated SSD or NVMe volume. Do not share the disk with ClickHouse data directories. The transaction log is written synchronously, so spinning disks or contended volumes are a common bottleneck. If you are using external ZooKeeper on a shared node, isolate it.
Break the thundering herd
If ClickHouse nodes are hammering a recovering coordination service with reconnections, do not restart ClickHouse processes. Restarts generate fresh session registration and watch storms, which amplify load. Let existing sessions settle. If individual nodes are stuck with expired sessions and the coordination service is already healthy, a rolling restart of ClickHouse may force clean reconnection, but only after latency has returned to normal. Rolling restarts are disruptive; schedule them carefully.
Address JVM GC (external ZooKeeper only)
If you run ZooKeeper rather than ClickHouse Keeper, check JVM heap usage and GC logs. GC pauses directly manifest as latency spikes. Tune heap size or consider migrating to ClickHouse Keeper, which is not JVM-based and eliminates this failure mode.
Prevention
- Limit replicated table sprawl. Every replicated table creates permanent metadata overhead in the coordination service. Use non-replicated MergeTree where high availability is not required.
- Serialize DDL. DDL queues are processed sequentially per node, and a burst of commands creates a metadata backlog that saturates Keeper.
- Monitor coordination latency as a first-class metric. Alert on P99 > 10 ms because the 10 ms to 100 ms range is the danger zone where sessions begin expiring.
- Give Keeper dedicated fast disks. The transaction log is written synchronously, so storage contention on the coordination node directly adds latency to every replicated operation.
- Right-size the ensemble. Plan coordination capacity for peak znode and watch count, not average load, to absorb table growth and DDL bursts.
How Netdata helps
- Correlate
ZooKeeperWaitMicrosecondsfromsystem.eventswith replicais_readonlystate to distinguish a coordination saturation spiral from a transient network blip. - Cross-reference
system.replication_queuedepth against distributed DDL queue status to identify whether a DDL storm is driving the outage. - Track disk I/O on Keeper nodes separately from ClickHouse data volumes to catch transaction-log bottlenecks before they propagate into session timeouts.
- Alert on
is_session_expiredfromsystem.replicasas an early indicator that the spiral has begun and writes are at risk. - Monitor inter-server network metrics alongside ZK latency to rule out network partition when replicas lose connectivity.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse cannot connect to ZooKeeper/Keeper: diagnosing the coordination layer
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion







