ClickHouse unauthorized DROP TABLE: auditing DDL and privilege anomalies
A production table disappears. An application returns “table does not exist.” In ClickHouse, DROP TABLE removes the table definition and MergeTree data parts from disk immediately. On replicated tables, a single ON CLUSTER command propagates through the coordination service and can erase the table everywhere before you intervene. There is no native undo.
This guide is for finding out what happened, determining blast radius, and closing the gaps.
What this means
DROP TABLE is not a slow failure. It removes the table definition and data parts from disk immediately. ReplicatedMergeTree propagates the drop through the coordination service, so a single command can replicate across the cluster before you can intervene. There is no DDL rollback log.
The most common cause is an over-privileged default user. In many legacy deployments, the default account has no password and broad grants. Applications, analysts, or automation that connect as default can execute destructive DDL. Cluster-wide ON CLUSTER commands amplify the risk because they execute on every reachable node.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Over-privileged default user | DROP executed by user default from an unexpected host | system.query_log filtered by user = 'default' and query_kind = 'Drop' |
| Compromised service account | DROP issued from a client host outside the known fleet | client_hostname and query timestamp in system.query_log |
| Human error via client tool | Analyst connected to production instead of staging | user and client_hostname at the exact event time |
| Application or migration bug | Unintended DROP during a deployment | DDL queries from deployment hosts just before the incident, or failed migration queries in system.query_log |
| Malicious insider or lateral movement | Large SELECT exports followed by DROP from the same session | Correlate high read_bytes with DDL in system.query_log |
Quick checks
Run these immediately. They are read-only.
-- Destructive DDL in the last hour by user and source host
SELECT
event_time,
user,
client_hostname,
query_kind,
substring(query, 1, 200) AS query_prefix
FROM system.query_log
WHERE query_kind IN ('Alter', 'Create', 'Drop', 'Truncate')
AND event_time > now() - INTERVAL 1 HOUR
ORDER BY event_time DESC;
-- Cluster DDL that did not finish on all replicas
SELECT
entry,
query,
status,
exception_text,
query_create_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;
-- Privilege exceptions that may indicate probing
SELECT
event_time,
user,
query,
exception
FROM system.query_log
WHERE exception LIKE '%ACCESS_DENIED%'
AND event_time > now() - INTERVAL 1 HOUR;
-- Failed login attempts
SELECT
event_time,
user,
client_address,
auth_type,
failure_reason,
type
FROM system.session_log
WHERE type = 'LoginFailure'
AND event_time > now() - INTERVAL 1 HOUR;
# Verify network exposure
ss -tlnp | grep clickhouse
How to diagnose it
- Confirm the event. Query
system.query_logforquery_kind = 'Drop'around the incident window. Verify whether the query finished (type = 'QueryFinish'). A successful DROP with no exception means the data is gone. - Determine if it was cluster-wide. If the query included
ON CLUSTER, inspectsystem.distributed_ddl_queue. Entries withstatus != 'Finished'mean some replicas did not execute the drop, leaving schema drift. If all hosts showFinished, the drop propagated everywhere. - Identify the actor. Capture
user,client_hostname, and the exact query text fromsystem.query_log. If the user isdefault, the over-privileged default account is the likely root cause. - Check for reconnaissance. In the minutes before the DROP, look for large
SELECTqueries from the same user or host with abnormally highread_bytesorresult_bytes. Extraction before destruction is a common attack pattern. - Audit privilege grants. Review which accounts hold DDL privileges. Flag service accounts with broader grants than their workload requires.
- Inspect replication health. If the table was replicated and the DROP was not uniform, compare
system.replicasacross nodes. A replica that missed the DDL may still hold the table data, while others do not.
flowchart TD
A[Table missing or schema changed] --> B{Query system.query_log}
B -->|Drop found| C[Record user, host, time]
B -->|No local entry| D{Check system.distributed_ddl_queue}
D -->|ON CLUSTER drop| E[Check per-replica status]
D -->|No entry| F[Check system.session_log]
C --> G[Audit grants for that user]
E --> H[Schema drift or full propagation]
G --> I[Revoke excess privileges]
F --> J[Investigate auth anomalies]
H --> K[Restore or resync replica]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.query_log DDL rate | Early signal of unauthorized schema changes | Any Drop or Truncate from non-admin users |
system.distributed_ddl_queue status | Reveals silent schema drift after cluster DDL | status != 'Finished' sustained for more than 5 minutes |
system.session_log login failures | Indicates brute force or credential misuse | More than 10 failures per minute from a single source |
system.query_log read_bytes per user | Flags data exfiltration that may precede destruction | Unexpected user scanning large volumes followed by DDL |
Fixes
Immediate containment
- Revoke DDL privileges from the affected account immediately. If the session is active, terminate any running queries and rotate the password.
- Rotate credentials for any service account that issued the DROP.
- If the event came from an unexpected host, verify ClickHouse is bound to internal interfaces only.
ss -tlnp | grep clickhouseshould not show public-facing listeners unless explicitly required.
Recovery
- Restore from backup. ClickHouse has no native point-in-time recovery for DDL. If you have filesystem backups or object-storage snapshots, recreate the table definition and reload the data.
- If the table was ReplicatedMergeTree and some replicas missed the DROP due to a failed distributed DDL, use
SYSTEM RESTART REPLICAto force a re-check against ZooKeeper state. If a replica diverged and lost parts,SYSTEM RESTORE REPLICAre-initializes it from other replicas. Be aware that this is destructive to local divergent state. - If no backup exists and one replica still holds the table data, that replica is your recovery source. Do not restart or resync it until you have extracted the data.
Privilege lockdown
- Do not run applications or ad-hoc tools as the
defaultuser. Create named accounts with restricted grants. - Restrict the
defaultuser to localhost-only connections, or disable it entirely if your deployment allows. - Apply role-based access control. Limit
ALTER,DROP,CREATE, andTRUNCATEto dedicated administrative accounts. - Bind ClickHouse to internal network interfaces via
listen_hostto reduce exposure.
Guardrails
- Set
max_table_size_to_dropto require an explicit configuration override before large production tables can be dropped. This prevents accidental drops from standard clients. - Review
distributed_ddl_task_timeout. The default is 180 seconds. If your cluster is large or nodes are under load, short timeouts can leave DDL half-applied, creating silent schema drift.
Prevention
- Audit before events. Periodically query
system.query_logfor DDL from unexpected users or hosts. A five-minute audit run once per shift catches drift early. - Least privilege. Service accounts should have read or append access only. DDL should require a dedicated admin role.
- Cluster DDL monitoring. Alert on
system.distributed_ddl_queueentries that do not reachFinished. A stuck DDL entry blocks everything queued behind it. - Session forensics. Ensure
session_logis enabled in the server configuration. Without it, you cannot reconstruct which client addresses connected or failed to authenticate.
How Netdata helps
- Correlate query latency spikes or error jumps with DDL events using
system.query_logtrends. - Flag anomalous sequences such as a large
SELECTfollowed byDROPfrom the same session. - Alert on replication health shifts (
is_readonly,is_session_expired) after cluster DDL to catch skipped nodes. - Watch
system.query_loggrowth for unusual volume from attack patterns or runaway automation.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse background pool saturation: when merges and mutations starve
- ClickHouse mark cache and uncompressed cache: reading low hit rates
- ClickHouse client connections climbing: TCP 9000, HTTP 8123, and connection leaks
- ClickHouse checksum mismatch and broken parts: detecting data corruption
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse detached parts piling up: reading system.detached_parts and reclaiming space
- ClickHouse disk space collapse: why merges need free space and how the spiral starts
- ClickHouse disk space monitoring: free_space, unreserved_space, and the 80% target
- ClickHouse distributed DDL stuck: ON CLUSTER queries that never finish







