ClickHouse Keeper latency high: the early warning before sessions expire
INSERTs to replicated tables slow down, ON CLUSTER DDL hangs, and the replication queue grows on followers. SELECT 1 and HTTP /ping stay healthy, and non-replicated tables are fine. The culprit is usually the coordination service, not ClickHouse itself.
Rising ZooKeeper or ClickHouse Keeper operation latency is a leading indicator. Replicated inserts, replication log updates, and distributed DDL all round-trip through Keeper. Because Keeper writes its transaction log synchronously, disk I/O on the Keeper node is the most common bottleneck. A degraded-but-connected coordination service is worse than a hard partition: it silently slows every replicated operation until sessions start expiring and replicas flip to readonly. This article explains how to read the early signals, isolate the cause, and fix it before sessions expire.
flowchart TD
A[Keeper/ZK transaction log fsync slow] --> B[Operation latency rises]
B --> C[Replicated insert and DDL round-trips slow]
C --> D[Replication queue grows]
C --> E[Insert latency rises]
B --> F[Session timeout risk]
F --> G[Replica session expires]
G --> H[Replicas become readonly]
H --> I[Replicated writes fail]What this means
“Keeper latency high” means the round-trip time for operations against ZooKeeper or ClickHouse Keeper is elevated from ClickHouse’s perspective. Baseline latency varies by network, but a local ensemble is typically single-digit milliseconds. When operation latency nears the negotiated session timeout, heartbeats can time out. Once a session expires, the replica drops its ephemeral nodes, flips to readonly, and must re-register before accepting writes.
The damage happens in stages. First, replicated inserts and DDL slow down because they wait on Keeper. Then followers fall behind because replication queue entries cannot be acknowledged quickly. If the session is lost, the replica becomes readonly until it reconnects and re-establishes its ephemeral nodes. During this window, writes to affected replicated tables can fail or require retries, even though the ClickHouse process appears healthy.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Keeper transaction log disk bottleneck | Latency spikes correlate with high disk await on the Keeper node; mntr shows elevated average latency. | `echo mntr |
| Too many replicated tables or watches | zk_znode_count and zk_watch_count are high or growing fast; latency rises with table count. | Count replicated tables and compare with historical znode and watch counts from mntr. |
| DDL or metadata storm | ON CLUSTER operations hang; system.distributed_ddl_queue shows unfinished entries; latency spikes during schema changes. | system.distributed_ddl_queue status and the rate of new DDL in system.query_log. |
| Network degradation short of partition | Keeper is reachable but RTT is high or retransmits are present; ClickHouse-side wait grows faster than server-side latency. | ss -i or netstat -s for retransmits; compare RTT between ClickHouse and Keeper hosts. |
| JVM GC pauses on external ZooKeeper | Regular latency spikes on ZooKeeper but not on ClickHouse Keeper; GC logs show long pauses. | ZooKeeper JVM GC logs and heap usage. This does not apply to ClickHouse Keeper. |
Quick checks
Run these read-only checks to confirm the symptom and narrow the cause.
# Test basic Keeper/ZK responsiveness
echo ruok | nc -w 2 <keeper-host> 2181
# Check Keeper server-side metrics; look for avg latency, znode count, watch count
echo mntr | nc -w 2 <keeper-host> 2181
# For built-in ClickHouse Keeper on port 9181
echo mntr | nc -w 2 localhost 9181
-- Check ClickHouse session health and negotiated timeout
SELECT name, host, port, is_expired, session_uptime_elapsed_seconds, session_timeout_ms
FROM system.zookeeper_connection;
-- Check replica session state
SELECT database, table, is_readonly, is_session_expired, total_replicas, active_replicas
FROM system.replicas
WHERE engine LIKE '%Replicated%';
-- Check cumulative ZooKeeper wait time from ClickHouse perspective
SELECT event, value
FROM system.events
WHERE event LIKE 'ZooKeeper%';
-- Check replication queue growth
SELECT database, table, queue_size, absolute_delay
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;
-- Check for DDL adding metadata pressure
SELECT entry, query, status, exception_text
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;
# Check disk I/O on the Keeper host; high await points to a transaction log bottleneck
iostat -xz 1 5
How to diagnose it
- Confirm latency from both sides. Compare
ZooKeeperWaitMicrosecondsinsystem.eventswith the average latency reported by Keeper’smntrcommand. If the server-side latency is low but ClickHouse wait is high, suspect the network path. If both are high, the bottleneck is on the Keeper node. - Inspect session health. Query
system.zookeeper_connection. Ifis_expired = 1, sessions are already dropping. Checksystem.replicasforis_session_expiredappearing across multiple tables. - Correlate with replica state. Query
system.replicasforis_readonlyandis_session_expired. If these appear on multiple tables simultaneously, the coordination service is the common factor. - Examine Keeper server metrics. From
mntr, watchzk_avg_latency,zk_znode_count,zk_watch_count, and any queue of outstanding requests. Rising znode or watch counts with rising latency point to metadata overload. - Check the disk under the transaction log. On the Keeper host, use
iostatto measureawaiton the volume that holds the transaction log. Sustained highawaiton the log volume is the smoking gun. Also check that the log directory is not filling. - Look for metadata churn. Check
system.distributed_ddl_queuefor stuck entries andsystem.query_logfor recent DDL. A large number of replicated tables, rapid schema changes, or a thundering herd of reconnecting nodes can all raise load. - Rule out network issues. Measure RTT and retransmits between ClickHouse and Keeper. If latency is high only from certain ClickHouse nodes, check their network paths.
- Choose the fix based on the dominant cause: disk bottleneck, metadata overload, network degradation, or session timeout mismatch.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Keeper average operation latency | Direct measure of coordination health. | Sustained upward trend, or approaching session_timeout_ms. |
ZooKeeperWaitMicroseconds rate | ClickHouse-side time spent waiting on Keeper. | Sustained upward trend or step change. |
is_expired from system.zookeeper_connection / is_session_expired from system.replicas | Indicates session flapping before hard failures. | is_expired = 1 or is_session_expired = 1. |
| Replication queue depth | Followers fall behind when coordination slows. | queue_size growing for > 15 minutes. |
| Insert latency P99 on replicated tables | Replicated inserts include Keeper round-trips. | P99 > 2x baseline without insert rate change. |
system.distributed_ddl_queue status | DDL stalls when Keeper is slow. | Entries stuck in non-Finished state. |
| ZK znode and watch count | Metadata overhead drives load. | Rapid growth or unusually high absolute values. |
| Keeper transaction log disk I/O wait | Synchronous tx log makes disk the usual bottleneck. | await elevated on the transaction log volume. |
Fixes
Keeper transaction log disk bottleneck
Move the ZooKeeper transaction log to a dedicated, low-latency disk, preferably NVMe, and separate from both ClickHouse data and application logs. ZooKeeper fsyncs every write before responding, so spinning disk, shared volumes, or exhausted SSDs directly raise operation latency. For built-in ClickHouse Keeper, place the Keeper log directory on fast storage and avoid sharing it with the ClickHouse data volume.
Tradeoffs: Changing dataLogDir for external ZooKeeper or the Keeper log path for built-in Keeper requires a restart of the coordination node. Schedule this after stabilizing the cluster, and never restart multiple Keeper nodes at once if it risks quorum loss.
Metadata overload from tables, watches, or DDL
Pause non-essential DDL, especially ON CLUSTER operations, until latency recovers. Reduce the number of replicated tables where possible by consolidating tables or using non-replicated engines for transient data. If you use a large replicated_deduplication_window, review whether it is causing excessive znode growth.
Tradeoffs: Pausing DDL delays schema changes. Reducing replicated tables reduces write availability guarantees for those tables.
Network path degradation
Fix routing, MTU mismatches, or packet loss between ClickHouse and Keeper. Keep the Keeper ensemble in the same low-latency network as the ClickHouse cluster. High RTT or retransmits amplify the effect of every synchronous Keeper operation.
Tradeoffs: Network changes carry their own risk and may require coordination with network or cloud infrastructure teams.
Session timeout tuning as a temporary buffer
If you need immediate relief while fixing the root cause, increase the session timeout. This reduces session flapping but does not fix the underlying latency problem.
Tradeoff: Longer timeouts mask coordination degradation and prolong stale reads during true partitions. Treat this as a temporary bridge, not a fix.
Prevention
- Monitor Keeper operation latency as a leading indicator, not just process liveness. Liveness checks miss the degraded-but-connected state.
- Keep the Keeper transaction log on dedicated fast storage with enough headroom. Watch the log directory for growth and the underlying disk for latency.
- Limit replicated table sprawl. Each replicated table adds znodes and watches; excessive table counts are a common root cause of Keeper saturation.
- Gate DDL during incidents. A DDL storm on an already slow Keeper can push it over the edge.
- Set alerts on the rate of
ZooKeeperWaitMicroseconds, state changes insystem.zookeeper_connection, and the derivative of replication queue depth. - Establish baselines during low-load windows using
mntrandsystem.zookeeper_connectionso you can spot deviations early.
How Netdata helps
- Correlate ClickHouse
ZooKeeperWaitMicrosecondswith host diskawaiton Keeper nodes to isolate transaction log disk saturation. - Track
is_expiredfromsystem.zookeeper_connectionagainstis_readonlyandis_session_expiredfromsystem.replicas. - Alert on insert latency P99 and
DelayedInsertsbeforeRejectedInsertsappear. - Plot replication queue depth derivatives and distributed DDL queue status alongside query error rates to separate coordination issues from query issues.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse distributed DDL stuck: ON CLUSTER queries that never finish
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse cannot connect to ZooKeeper/Keeper: diagnosing the coordination layer
- ClickHouse Keeper saturation spiral: too many tables, DDL storms, and cluster freeze
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly







