ClickHouse distributed DDL stuck: ON CLUSTER queries that never finish

ON CLUSTER DDL is not atomic. The initiator writes a task into a shared queue in ZooKeeper or ClickHouse Keeper under /clickhouse/task_queue/ddl/. Each node pulls entries in order, executes the DDL locally, and reports status back. One slow, down, or misconfigured node can leave the entry unfinished forever, blocking all later DDL on that node and creating schema drift that is invisible until a replica rejects an insert or a subsequent ALTER fails.

The client timeout (distributed_ddl_task_timeout, default 180 seconds) is independent of background execution. Long operations such as MATERIALIZE COLUMN on large tables often time out in the client while merges or mutations continue. Distinguishing a genuine hang from a slow but valid operation is the first step to avoiding harmful intervention.

What this means

system.distributed_ddl_queue exposes the per-host state of the shared queue. A status of Finished means that node completed the task. Any other status, or a missing status from an unreachable node, leaves the entry incomplete. Nodes process the queue sequentially, so one stuck entry blocks everything after it on that node.

flowchart TD
    A[DDL hangs or times out] --> B[Check distributed_ddl_queue]
    B --> C{Status != Finished?}
    C -->|No| D[Check client timeout]
    C -->|Yes| E[Find unfinished host]
    E --> F[Check exception_text]
    F --> G{Schema or ZK issue?}
    G -->|Schema| H[Reconcile across replicas]
    G -->|ZK/Node| I[Fix session or node]
    H --> J[Retry DDL]
    I --> J

Common causes

CauseWhat it looks likeFirst thing to check
Down or unreachable nodeOne host never reports status; entry age grows while others show Finishedsystem.distributed_ddl_queue filtered by entry and host_name
ZooKeeper or Keeper session lossis_session_expired = 1 on the replica; multiple DDL entries stallsystem.replicas and system.zookeeper_connection
Incompatible schemaexception_text shows mismatched columns, missing table, or type conflictSHOW CREATE TABLE output diffed across replicas
Long-running DDL exceeding timeoutStatus is Finished on most hosts but the entry age exceeds 180 secondssystem.merges or system.mutations on the slow host
Orphaned entry from vanished originatorVery old entry with no active host processing itquery_create_time against the current cluster member list

Quick checks

-- List unfinished DDL entries
SELECT entry, query, status, exception_text, query_create_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;
-- Map an unfinished entry to specific hosts
SELECT entry, host_name, status, exception_code, exception_text
FROM system.distributed_ddl_queue
WHERE entry = '<stuck_entry_id>';
-- Check coordination service connectivity
SELECT name, host, port, is_expired, session_timeout_ms
FROM system.zookeeper_connection;
# Test external ZooKeeper responsiveness
echo "ruok" | nc <zookeeper-host> 2181
# Test ClickHouse Keeper responsiveness
echo ruok | nc localhost 9181
-- Check replica session and readonly state
SELECT database, table, is_readonly, is_session_expired
FROM system.replicas
WHERE engine LIKE '%Replicated%';
-- Verify the node can still query ZK
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;
# Check node liveness from outside ClickHouse
nc -zv <stuck-host> 9000

How to diagnose it

  1. Query system.distributed_ddl_queue for entries where status != 'Finished'. Sort by query_create_time and focus on the oldest stuck entry.
  2. Filter the queue by that entry ID to see per-host status. Identify which host or hosts are not Finished.
  3. Read exception_text on any host reporting an error. This usually tells you whether the failure is schema-related, a resource issue, or a timeout.
  4. If the exception mentions missing tables or column mismatches, collect SHOW CREATE TABLE db.table from at least two replicas and diff the output.
  5. If there is no exception and the host simply lacks a status, check whether the node is reachable on TCP port 9000 and HTTP port 8123.
  6. On the stuck host, query system.zookeeper_connection. If the session is expired, test ZK or Keeper responsiveness from that host with nc or the system.zookeeper probe.
  7. Check system.replicas on the stuck host for is_readonly = 1 or is_session_expired = 1. A read-only replica cannot execute DDL that requires write access to replicated tables.
  8. If the entry age is high but exception_text is empty and the host appears healthy, verify whether the DDL operation is long-running. Check system.merges or system.mutations on the affected host to see if a background operation is still progressing.
  9. For entries older than hours or days where the originating node is gone, consider whether the entry is orphaned and blocking the queue.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.distributed_ddl_queue statusDirect view of per-node DDL completionAny row with status != 'Finished' sustained > 5 minutes
ZooKeeper or Keeper operation latencyDDL queue state lives in the coordination service; latency spikes stall propagationZooKeeperWaitMicroseconds trending up, or is_expired = 1
Replica session expiryExpired sessions break replication and DDL coordinationis_session_expired = 1 on any replica
Replica readonly stateRead-only replicas skip or fail DDL that needs writesis_readonly = 1 sustained
Background pool utilizationDDL such as MATERIALIZE COLUMN competes for merge threadsMerge pool > 90% while DDL entry age grows
distributed_ddl_task_timeoutShort timeouts hide legitimate long operationsFrequent client timeouts with no host error

Fixes

Down or unreachable node

Restore network connectivity or restart the ClickHouse process on the affected host. If the node is permanently decommissioned, remove it from the cluster definition and reconcile any queue entries that expect it.

Warning: Restarting the process interrupts queries on that host and triggers cache warmup.

ZooKeeper or Keeper coordination failure

Stop issuing new DDL immediately to avoid compounding the queue backlog. Check ZK ensemble health with echo mntr | nc <host> 2181 or the Keeper metrics endpoint. If the coordination service is healthy but ClickHouse holds expired sessions, restart ClickHouse on the affected node to force reconnection.

Warning: If ZK itself is saturated, address the storage or network bottleneck first. Restarting ClickHouse nodes during a ZK overload will make the situation worse by generating a reconnection storm.

Incompatible schema on a replica

Compare SHOW CREATE TABLE db.table across replicas. If one host has an extra column, missing index, or different type, align the schema manually on the outlier node using targeted ALTER TABLE statements. Do not reissue the original ON CLUSTER command until all nodes are in a consistent state, or you risk creating another stuck entry.

Warning: Manual schema changes bypass the distributed queue and require careful ordering.

Long-running DDL outliving the timeout

Increase distributed_ddl_task_timeout to match the expected duration of large operations. Verify progress on the slow host by checking system.merges or system.mutations rather than relying on the client timeout. If the operation is legitimate but slow, let it finish.

Warning: Longer timeouts mean the client holds the connection and the queue entry remains visible for an extended period.

Orphaned entries

If an entry from a crashed or removed originator blocks the queue and the operation is clearly abandoned, manual cleanup of the corresponding znode under /clickhouse/task_queue/ddl/ in ZooKeeper may be required.

Warning: This is dangerous. Confirm the entry is not still active on any host, and back up coordination service state before deleting anything. Editing metadata directly can corrupt cluster state if done incorrectly.

Prevention

  • Poll system.distributed_ddl_queue after every ON CLUSTER deployment. Do not treat DDL as fire-and-forget in automation.
  • Keep ZooKeeper or Keeper latency below 10 ms and avoid running DDL during maintenance windows that could trigger leader elections.
  • After restarting a node or recovering from a network partition, verify system.replicas shows no expired sessions before issuing schema changes.
  • Set distributed_ddl_task_timeout based on the largest expected DDL operation, not the default 180 seconds.
  • Maintain identical table definitions across replicas. Drift often starts when a previous DDL failed silently on one node and no one checked.

How Netdata helps

  • Correlate stuck DDL entries with ZooKeeper latency spikes across the same time window to isolate coordination bottlenecks.
  • Alert on replica session expiry and readonly state, which are leading indicators that DDL will fail.
  • Track background merge pool saturation that can prolong DDL execution and trigger false timeouts.
  • Correlate DDL timing with cluster-wide insert latency or replication queue growth to detect compound pressure events.