$ guides / clickhouse / clickhouse-system-tables-unbounded-growth ▌

Operations Guides

ClickHouse system log tables eating disk: query_log, part_log, and TTL

You are investigating disk growth on a ClickHouse node. User tables look reasonable, but system.query_log or system.part_log are consuming tens or hundreds of gigabytes. These tables use MergeTree-family engines and are subject to the same part-count and merge pressure limits as production tables. Without a TTL rule, data accumulates forever. During incidents with high query error rates or retry storms, query_log can expand fast enough to threaten disk capacity and stall logging.

ClickHouse flushes telemetry in batches to system.query_log, system.part_log, system.trace_log, and system.text_log. Each flush creates a data part. Without TTL, old parts are never dropped. On high-QPS clusters, frequent flushes create many parts, and every failed query is logged, so error spikes accelerate growth. Eventually the system tables become the largest disk consumers. If part counts cross parts_to_delay_insert or parts_to_throw_insert, new log inserts are delayed or rejected.

What this means

System log tables behave exactly like MergeTree tables: inserts create active parts, background merges consolidate them, and part-count limits apply. No TTL means unbounded disk growth.

The risk is not just space. When system.query_log accumulates merge debt, it competes with production tables for background pool threads and disk I/O. If it crosses its part limit, new log entries are rejected. You lose observability exactly when you need it.

Common causes

Cause	What it looks like	First thing to check
No TTL on system log tables	`system.query_log` or `system.part_log` rank among top disk consumers; oldest parts span months	`SELECT ... FROM system.parts WHERE database = 'system'` for size and part counts
High-error incident	`query_log` disk usage spikes after an outage; error rate correlates with growth	`FailedQuery` in `system.events` around the incident window
High merge or insert activity	`part_log` is large on write-heavy clusters with frequent merges or mutations	Merge throughput and `part_log` event volume
Merge starvation on system tables	`system.query_log` part count climbs toward `parts_to_throw_insert`	Active parts per partition for system tables

Quick checks

-- System tables consuming disk
SELECT
    database,
    table,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    count() AS active_parts
FROM system.parts
WHERE active = 1 AND database = 'system'
GROUP BY database, table
ORDER BY size DESC;

-- Per-partition part counts for system tables
SELECT
    table,
    partition_id,
    count() AS parts_in_partition
FROM system.parts
WHERE active = 1 AND database = 'system'
GROUP BY table, partition_id
ORDER BY parts_in_partition DESC
LIMIT 10;

-- Query log volume over the last 24 hours
SELECT
    toStartOfHour(event_time) AS hour,
    count() AS queries,
    countIf(type = 'ExceptionWhileProcessing') AS errors
FROM system.query_log
WHERE event_time > now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour DESC;

-- Inspect engine definition for TTL and partitioning
SELECT engine_full FROM system.tables WHERE database = 'system' AND name = 'query_log';

-- Insert limits hit on any table
SELECT event, value
FROM system.events
WHERE event IN ('RejectedInserts', 'DelayedInserts');

-- Active merges on system tables
SELECT
    database,
    table,
    elapsed,
    progress
FROM system.merges
WHERE database = 'system';

How to diagnose it

Measure system table footprint. Run the system.parts size query. If system.query_log, system.part_log, trace_log, or text_log exceed your retention budget, they are the primary consumers.
Check part density. High active part counts in system.query_log indicate merge pressure.
Correlate with workload. A spike in ExceptionWhileProcessing or overall query volume explains sudden query_log growth.
Verify TTL. SELECT engine_full FROM system.tables or SHOW CREATE TABLE system.query_log reveals whether a TTL clause exists. Absence confirms unbounded growth.
Assess merge health. Query system.merges for the system database. If merges are scarce while parts grow, the table will hit limits regardless of TTL.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
System table disk size (`system.parts.bytes_on_disk`)	Direct measure of unbounded growth	Any system table larger than your largest production table
Active parts in `system.query_log`	Predicts too-many-parts on the log table itself	Part count trending toward your `parts_to_delay_insert` threshold
Failed query rate	Failed queries are still logged; errors accelerate `query_log` growth	Sudden spike in `ExceptionWhileProcessing` entries
`system.part_log` size	High merge activity generates many part events	Steady growth on write-heavy clusters
`DelayedInserts` / `RejectedInserts`	System tables can trigger these limits internally	Nonzero deltas with system tables near part limits

Fixes

Configure TTL on system tables

Set a TTL expression on system.query_log, system.part_log, system.trace_log, and system.text_log. Default installations usually partition these tables by toYYYYMMDD(event_date), so a TTL on event_date drops entire partitions efficiently during merges.

For an existing table, apply TTL with ALTER TABLE:

ALTER TABLE system.query_log MODIFY TTL event_date + INTERVAL 7 DAY;

For tables created from server configuration, set the <ttl> element inside the log table’s configuration block:

<query_log>
    <ttl>event_date + INTERVAL 30 DAY</ttl>
</query_log>

Configuration changes do not alter tables that already exist. Apply ALTER TABLE to existing system tables, or drop and recreate the table after changing its configuration.

Reclaim space from an overgrown system table

TTL runs during merges. If a system table already has hundreds of active parts and merges are starved, TTL alone may not reclaim space quickly.

For immediate relief, drop old partitions directly after verifying partition IDs in system.parts:

ALTER TABLE system.query_log DROP PARTITION ID '20240601';

Warning: this permanently deletes data for that partition. Verify the partition ID and backup any critical forensic data first.

If merges are proceeding but too slowly, reduce incoming query volume or error rates to slow part creation while TTL catches up.

Balance retention against forensics

Shorter TTL saves disk and reduces merge debt. Longer TTL aids incident investigation. query_log is usually the most valuable retention target. trace_log and text_log are often the largest relative to their utility, so consider shorter retention for verbose logs.

Ensure time-based partitioning

Default system log tables are partitioned by toYYYYMMDD(event_date). If the engine definition shows a non-time partition key, TTL drops become expensive because they must delete row ranges inside parts instead of dropping whole partitions. Keep time-based partitioning when customizing system log engine definitions.

Prevention

Set TTL on all system MergeTree tables before taking production traffic.
Include system.query_log and system.part_log in disk capacity planning.
Monitor the system database size with the same urgency as production databases.
During incident response, expect query_log to expand rapidly and verify it has disk and part-count headroom.

How Netdata helps

Correlates ClickHouse disk usage with table-level part counts to surface when system.query_log or system.part_log become top consumers.
Tracks RejectedInserts and DelayedInserts to catch system tables approaching part limits.
Monitors query error rates to warn that query_log is about to balloon.
Visualizes active part counts per table, including system tables, to spot merge debt before it blocks inserts.
Alerts on disk space utilization with awareness of merge headroom requirements.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse system log tables eating disk: query_log, part_log, and TTL

ClickHouse system log tables eating disk: query_log, part_log, and TTL

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Configure TTL on system tables

Reclaim space from an overgrown system table

Balance retention against forensics

Ensure time-based partitioning

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata