$ guides / clickhouse / clickhouse-keeper-saturation-spiral ▌

Operations Guides

ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze

INSERTs fail. Replicated tables flip read-only. ON CLUSTER DDL hangs. curl http://localhost:8123/ping still returns Ok. and SELECT 1 still works. This is Keeper saturation: the coordination layer is choking while liveness probes give false confidence.

The spiral starts when ZooKeeper or ClickHouse Keeper cannot keep up with metadata load. Every replicated table registers znodes and watches. Every DDL operation adds more. Coordination latency climbs until heartbeat traffic cannot complete within the negotiated session timeout; sessions expire, replicas become read-only, and writes fail. Reconnecting nodes then trigger a thundering herd.

Healthy operation latency stays below 10 ms P99. Past 10 ms the ensemble is under stress; past 100 ms it is severely degraded. Sessions expire when the client cannot exchange heartbeats within the session timeout window, which becomes likely once latency is sustained in the hundreds of milliseconds or when packet loss and GC pauses block the heartbeat path. The first response is always the same: stop all DDL.

flowchart TD
    A[Metadata explosion or DDL storm] --> B[ZK znode count grows]
    B --> C[Transaction log I/O saturated]
    C --> D[Severe latency spike blocks heartbeats]
    D --> E[Session timeouts]
    E --> F[Replicas read-only]
    F --> G[Cluster-wide write failure]
    G --> H[Reconnection storm]
    H --> C

What this means

ClickHouse relies on a coordination service for replicated tables, distributed DDL, and replication queues. When that service saturates, the database does not crash. It loses the ability to mutate replicated state. Replicas with expired sessions stop accepting writes. Distributed DDL tasks stall. Non-replicated tables keep working, which makes the outage scope confusing.

Common causes

Cause	What it looks like	First thing to check
Metadata explosion from too many replicated tables	`system.zookeeper` queries under `/clickhouse/tables` return huge child counts; latency climbs even with low query load	Count replicated tables and znodes under the configured root path
DDL storm	Many `CREATE`, `ALTER`, or `DROP` operations running concurrently; `system.distributed_ddl_queue` has unfinished entries	Query `system.query_log` for recent `query_kind IN ('Alter', 'Create', 'Drop')`
ZK transaction log disk bottleneck	Keeper/ZooKeeper disk I/O is saturated while CPU is low; `mntr` shows high avg latency	Disk latency on the volume hosting the transaction log
Post-outage reconnection thundering herd	ZK recovered but ClickHouse nodes reconnect simultaneously, driving latency back up	Connection count spikes on ZK ports after a recent leader election
JVM GC pauses (external ZooKeeper only)	Latency spikes correlate with GC log entries; not applicable to ClickHouse Keeper	JVM heap usage and GC logs on ZooKeeper nodes

Quick checks

Run these read-only probes to confirm a coordination saturation event.

-- Test live ZK connectivity from ClickHouse
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;

-- Check session health and expiration
SELECT host, port, is_expired, session_uptime_elapsed_seconds, session_timeout_ms
FROM system.zookeeper_connection;

-- Find read-only or session-expired replicas
SELECT database, table, is_leader, is_readonly, is_session_expired, total_replicas, active_replicas
FROM system.replicas;

-- Inspect replication queue for stuck entries
SELECT database, table, type, create_time, last_attempt_time, num_tries, last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;

-- Check distributed DDL status where system.distributed_ddl_queue is available
SELECT entry, host_name, status, exception_text, query_create_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;

# Probe ClickHouse Keeper directly (4lw commands on port 9181)
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181

-- ClickHouse-side ZK event counters
SELECT event, value FROM system.events WHERE event LIKE 'ZooKeeper%';

How to diagnose it

Confirm liveness is a false signal. Verify that curl http://localhost:8123/ping returns Ok. and that local SELECT queries execute. If the server is alive but replicated inserts fail, you are looking at coordination failure, not a process crash.
Check coordination latency. Run SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1 and time it. Sub-100 ms is normal; multi-second response or timeout confirms saturation.
Inspect replica sessions. Query system.replicas for is_readonly = 1 or is_session_expired = 1. If these are widespread, the cluster has already lost its sessions.
Identify the metadata load. Count recent DDL in system.query_log. Check system.zookeeper for znode counts under the configured table root. A sudden jump implicates a DDL storm or table proliferation.
Examine queues. system.replication_queue entries with high num_tries indicate followers are retrying failed operations. system.distributed_ddl_queue entries stuck in a non-Finished status confirm DDL is backing up.
Probe Keeper or ZooKeeper directly. Use mntr to read average latency, outstanding requests, and open file descriptor count on the coordination nodes. If disk latency on the transaction log volume is high, storage is the bottleneck.
Look for the thundering herd. If the event started after a ZK restart or leader election, check whether ClickHouse nodes are reconnecting simultaneously. system.zookeeper_connection session uptimes that cluster around the same second suggest a herd.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
ZK/Keeper operation latency P99	Predicts session timeouts before they happen	P99 > 10 ms sustained; approaching 100 ms is critical
Replica `is_readonly` state	Direct indicator that a replica has stopped accepting writes	Any sustained `is_readonly = 1` for more than 5 minutes
Replica `is_session_expired` state	Shows the replica has lost coordination session	Any non-zero value is abnormal
Replication queue depth	Measures how far followers are falling behind	Growing steadily over 15 minutes
Distributed DDL queue status	Reveals schema changes stuck across the cluster	Entries in non-Finished status longer than 5 minutes
`ZooKeeperUserExceptions` / `ZooKeeperHardwareExceptions` rate	Client-side view of coordination failures	Any sustained increase from baseline
ZK `mntr` average latency	Ensemble-level health independent of ClickHouse metrics	Average latency rising toward the session timeout threshold

Fixes

Stop all DDL immediately

If a DDL storm is active, halt every CREATE, ALTER, and DROP operation. DDL creates znodes, which deepens saturation. Wait for coordination latency to drop and the distributed DDL queue to drain before resuming schema changes. Do this first in an active spiral.

Reduce metadata pressure

If the root cause is thousands of replicated tables, reduce the znode count. Drop unused replicated tables or consolidate them. This is not a quick fix: dropping replicated tables also touches the coordination service. Spread the work across maintenance windows.

Relieve ZK disk I/O

Move the ZooKeeper transaction log or ClickHouse Keeper log and snapshot storage to a dedicated SSD or NVMe volume. Do not share the disk with ClickHouse data directories. The transaction log is written synchronously, so spinning disks or contended volumes are a common bottleneck. If you are using external ZooKeeper on a shared node, isolate it.

Break the thundering herd

If ClickHouse nodes are hammering a recovering coordination service with reconnections, do not restart ClickHouse processes. Restarts generate fresh session registration and watch storms, which amplify load. Let existing sessions settle. If individual nodes are stuck with expired sessions and the coordination service is already healthy, a rolling restart of ClickHouse may force clean reconnection, but only after latency has returned to normal. Rolling restarts are disruptive; schedule them carefully.

Address JVM GC (external ZooKeeper only)

If you run ZooKeeper rather than ClickHouse Keeper, check JVM heap usage and GC logs. GC pauses directly manifest as latency spikes. Tune heap size or consider migrating to ClickHouse Keeper, which is not JVM-based and eliminates this failure mode.

Prevention

Limit replicated table sprawl. Every replicated table creates permanent metadata overhead in the coordination service. Use non-replicated MergeTree where high availability is not required.
Serialize DDL. DDL queues are processed sequentially per node, and a burst of commands creates a metadata backlog that saturates Keeper.
Monitor coordination latency as a first-class metric. Alert on P99 > 10 ms because the 10 ms to 100 ms range is the danger zone where sessions begin expiring.
Give Keeper dedicated fast disks. The transaction log is written synchronously, so storage contention on the coordination node directly adds latency to every replicated operation.
Right-size the ensemble. Plan coordination capacity for peak znode and watch count, not average load, to absorb table growth and DDL bursts.

How Netdata helps

Correlate ZooKeeperWaitMicroseconds from system.events with replica is_readonly state to distinguish a coordination saturation spiral from a transient network blip.
Cross-reference system.replication_queue depth against distributed DDL queue status to identify whether a DDL storm is driving the outage.
Track disk I/O on Keeper nodes separately from ClickHouse data volumes to catch transaction-log bottlenecks before they propagate into session timeouts.
Alert on is_session_expired from system.replicas as an early indicator that the spiral has begun and writes are at risk.
Monitor inter-server network metrics alongside ZK latency to rule out network partition when replicas lose connectivity.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze

ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Stop all DDL immediately

Reduce metadata pressure

Relieve ZK disk I/O

Break the thundering herd

Address JVM GC (external ZooKeeper only)

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata