ClickHouse unauthorized DROP TABLE: auditing DDL and privilege anomalies

A production table disappears. An application returns “table does not exist.” In ClickHouse, DROP TABLE removes the table definition and MergeTree data parts from disk immediately. On replicated tables, a single ON CLUSTER command propagates through the coordination service and can erase the table everywhere before you intervene. There is no native undo.

This guide is for finding out what happened, determining blast radius, and closing the gaps.

What this means

DROP TABLE is not a slow failure. It removes the table definition and data parts from disk immediately. ReplicatedMergeTree propagates the drop through the coordination service, so a single command can replicate across the cluster before you can intervene. There is no DDL rollback log.

The most common cause is an over-privileged default user. In many legacy deployments, the default account has no password and broad grants. Applications, analysts, or automation that connect as default can execute destructive DDL. Cluster-wide ON CLUSTER commands amplify the risk because they execute on every reachable node.

Common causes

CauseWhat it looks likeFirst thing to check
Over-privileged default userDROP executed by user default from an unexpected hostsystem.query_log filtered by user = 'default' and query_kind = 'Drop'
Compromised service accountDROP issued from a client host outside the known fleetclient_hostname and query timestamp in system.query_log
Human error via client toolAnalyst connected to production instead of staginguser and client_hostname at the exact event time
Application or migration bugUnintended DROP during a deploymentDDL queries from deployment hosts just before the incident, or failed migration queries in system.query_log
Malicious insider or lateral movementLarge SELECT exports followed by DROP from the same sessionCorrelate high read_bytes with DDL in system.query_log

Quick checks

Run these immediately. They are read-only.

-- Destructive DDL in the last hour by user and source host
SELECT
    event_time,
    user,
    client_hostname,
    query_kind,
    substring(query, 1, 200) AS query_prefix
FROM system.query_log
WHERE query_kind IN ('Alter', 'Create', 'Drop', 'Truncate')
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY event_time DESC;
-- Cluster DDL that did not finish on all replicas
SELECT
    entry,
    query,
    status,
    exception_text,
    query_create_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;
-- Privilege exceptions that may indicate probing
SELECT
    event_time,
    user,
    query,
    exception
FROM system.query_log
WHERE exception LIKE '%ACCESS_DENIED%'
  AND event_time > now() - INTERVAL 1 HOUR;
-- Failed login attempts
SELECT
    event_time,
    user,
    client_address,
    auth_type,
    failure_reason,
    type
FROM system.session_log
WHERE type = 'LoginFailure'
  AND event_time > now() - INTERVAL 1 HOUR;
# Verify network exposure
ss -tlnp | grep clickhouse

How to diagnose it

  1. Confirm the event. Query system.query_log for query_kind = 'Drop' around the incident window. Verify whether the query finished (type = 'QueryFinish'). A successful DROP with no exception means the data is gone.
  2. Determine if it was cluster-wide. If the query included ON CLUSTER, inspect system.distributed_ddl_queue. Entries with status != 'Finished' mean some replicas did not execute the drop, leaving schema drift. If all hosts show Finished, the drop propagated everywhere.
  3. Identify the actor. Capture user, client_hostname, and the exact query text from system.query_log. If the user is default, the over-privileged default account is the likely root cause.
  4. Check for reconnaissance. In the minutes before the DROP, look for large SELECT queries from the same user or host with abnormally high read_bytes or result_bytes. Extraction before destruction is a common attack pattern.
  5. Audit privilege grants. Review which accounts hold DDL privileges. Flag service accounts with broader grants than their workload requires.
  6. Inspect replication health. If the table was replicated and the DROP was not uniform, compare system.replicas across nodes. A replica that missed the DDL may still hold the table data, while others do not.
flowchart TD
    A[Table missing or schema changed] --> B{Query system.query_log}
    B -->|Drop found| C[Record user, host, time]
    B -->|No local entry| D{Check system.distributed_ddl_queue}
    D -->|ON CLUSTER drop| E[Check per-replica status]
    D -->|No entry| F[Check system.session_log]
    C --> G[Audit grants for that user]
    E --> H[Schema drift or full propagation]
    G --> I[Revoke excess privileges]
    F --> J[Investigate auth anomalies]
    H --> K[Restore or resync replica]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.query_log DDL rateEarly signal of unauthorized schema changesAny Drop or Truncate from non-admin users
system.distributed_ddl_queue statusReveals silent schema drift after cluster DDLstatus != 'Finished' sustained for more than 5 minutes
system.session_log login failuresIndicates brute force or credential misuseMore than 10 failures per minute from a single source
system.query_log read_bytes per userFlags data exfiltration that may precede destructionUnexpected user scanning large volumes followed by DDL

Fixes

Immediate containment

  • Revoke DDL privileges from the affected account immediately. If the session is active, terminate any running queries and rotate the password.
  • Rotate credentials for any service account that issued the DROP.
  • If the event came from an unexpected host, verify ClickHouse is bound to internal interfaces only. ss -tlnp | grep clickhouse should not show public-facing listeners unless explicitly required.

Recovery

  • Restore from backup. ClickHouse has no native point-in-time recovery for DDL. If you have filesystem backups or object-storage snapshots, recreate the table definition and reload the data.
  • If the table was ReplicatedMergeTree and some replicas missed the DROP due to a failed distributed DDL, use SYSTEM RESTART REPLICA to force a re-check against ZooKeeper state. If a replica diverged and lost parts, SYSTEM RESTORE REPLICA re-initializes it from other replicas. Be aware that this is destructive to local divergent state.
  • If no backup exists and one replica still holds the table data, that replica is your recovery source. Do not restart or resync it until you have extracted the data.

Privilege lockdown

  • Do not run applications or ad-hoc tools as the default user. Create named accounts with restricted grants.
  • Restrict the default user to localhost-only connections, or disable it entirely if your deployment allows.
  • Apply role-based access control. Limit ALTER, DROP, CREATE, and TRUNCATE to dedicated administrative accounts.
  • Bind ClickHouse to internal network interfaces via listen_host to reduce exposure.

Guardrails

  • Set max_table_size_to_drop to require an explicit configuration override before large production tables can be dropped. This prevents accidental drops from standard clients.
  • Review distributed_ddl_task_timeout. The default is 180 seconds. If your cluster is large or nodes are under load, short timeouts can leave DDL half-applied, creating silent schema drift.

Prevention

  • Audit before events. Periodically query system.query_log for DDL from unexpected users or hosts. A five-minute audit run once per shift catches drift early.
  • Least privilege. Service accounts should have read or append access only. DDL should require a dedicated admin role.
  • Cluster DDL monitoring. Alert on system.distributed_ddl_queue entries that do not reach Finished. A stuck DDL entry blocks everything queued behind it.
  • Session forensics. Ensure session_log is enabled in the server configuration. Without it, you cannot reconstruct which client addresses connected or failed to authenticate.

How Netdata helps

  • Correlate query latency spikes or error jumps with DDL events using system.query_log trends.
  • Flag anomalous sequences such as a large SELECT followed by DROP from the same session.
  • Alert on replication health shifts (is_readonly, is_session_expired) after cluster DDL to catch skipped nodes.
  • Watch system.query_log growth for unusual volume from attack patterns or runaway automation.