ClickHouse detached parts piling up: reading system.detached_parts and reclaiming space

Disk usage is climbing and system.disks shows unreserved_space shrinking, yet active part counts in system.parts look normal. Queries are not failing, merges appear healthy, and there is no insert rejection storm. The hidden consumer is often a growing pile of detached parts: directories that ClickHouse removed from the active dataset but left on the filesystem. Unlike active parts, detached parts do not participate in queries or merges, and ClickHouse never deletes them automatically. Standard monitoring that only watches system.parts with active = 1 misses them entirely. This article explains how to read system.detached_parts, interpret the reason column, decide whether to reattach or discard the data, and prevent silent recurrence.

What this means

In ClickHouse, MergeTree tables store data in immutable parts. When ClickHouse detects a checksum mismatch, a failed replication fetch, or merge corruption, it can detach the part: move the directory out of the active dataset into the detached/ folder under the table’s data path. Operators can also trigger this manually with ALTER TABLE ... DETACH PARTITION or DETACH PART. Once detached, the part is excluded from the active set. It will not be read by SELECT, merged by background processes, or counted in system.parts where active = 1.

Because system.parts only surfaces active parts, a table can report a healthy part count while its detached/ directory grows by gigabytes or terabytes. This creates a dangerous gap between logical table size and actual disk consumption. The server will not clean up these directories automatically, so they persist until an operator explicitly reattaches them with ATTACH or removes them with DROP DETACHED PART. On replicated setups, detached parts can accumulate on a single replica after fetch failures, leaving that replica with a disk footprint far larger than its peers even though replication appears healthy from the queue.

flowchart TD
    A[Active part in system.parts] --> B{Corruption, fetch failure, or manual DETACH}
    B --> C[Part moved to detached/]
    C --> D[Not visible in system.parts active=1]
    D --> E[Continues consuming disk]
    E --> F[Operator action: ATTACH or DROP DETACHED PART]

Common causes

CauseWhat it looks likeFirst thing to check
Checksum or corruption failuresreason references checksum, corrupt, or broken part; server logs contain Checksum or Broken partsystem.detached_parts filtered by reason and recent modification time
Failed replication fetchessystem.replication_queue shows last_exception for fetches; ReplicatedPartFailedFetches is increasingsystem.replication_queue where num_tries > 0 and type relates to part fetches
Manual DETACH PARTITION left behindreason is empty or notes manual detach; system.query_log shows recent ALTER TABLE ... DETACH DDLsystem.detached_parts sorted by modification time correlated with system.query_log DDL events

Quick checks

Run these safe, read-only probes to measure the leak and identify its source.

-- List the largest detached parts with reasons
SELECT
    database,
    table,
    name,
    reason,
    formatReadableSize(bytes_on_disk) AS size,
    modification_time
FROM system.detached_parts
ORDER BY bytes_on_disk DESC
LIMIT 20;
-- Total disk consumed by all detached parts
SELECT
    formatReadableSize(sum(bytes_on_disk)) AS total_detached
FROM system.detached_parts;
-- Detached parts grouped by table
SELECT
    database,
    table,
    count() AS detached_count,
    formatReadableSize(sum(bytes_on_disk)) AS detached_size
FROM system.detached_parts
GROUP BY database, table
ORDER BY detached_size DESC;
-- Compare detached footprint to active data on the same table
SELECT
    d.database,
    d.table,
    formatReadableSize(sum(d.bytes_on_disk)) AS detached_size,
    formatReadableSize(sum(p.bytes_on_disk)) AS active_size
FROM system.detached_parts d
LEFT JOIN system.parts p ON d.database = p.database AND d.table = p.table AND p.active = 1
GROUP BY d.database, d.table
ORDER BY detached_size DESC
LIMIT 10;
-- Check replication queue for fetch failures that may precede detachment
SELECT
    database,
    table,
    type,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 10;
-- Find recent manual DETACH operations
SELECT
    event_time,
    user,
    query
FROM system.query_log
WHERE query_kind = 'Alter'
  AND query LIKE '%DETACH%'
  AND event_time > now() - INTERVAL 7 DAY
ORDER BY event_time DESC
LIMIT 10;
-- Search server logs for corruption indicators around detach time
SELECT
    event_time,
    logger_name,
    message
FROM system.text_log
WHERE (message LIKE '%Checksum%' OR message LIKE '%Broken part%')
  AND event_time > now() - INTERVAL 1 DAY
ORDER BY event_time DESC
LIMIT 20;

How to diagnose it

  1. Quantify the hidden consumption. Run sum(bytes_on_disk) against system.detached_parts. Compare this to the active data size from system.parts where active = 1. If detached parts represent a large fraction of total disk usage, they are the primary driver of space pressure.

  2. Read the reason column. Non-empty reasons referencing checksum, corrupt, or fetch errors point to automatic detachment after integrity or replication failures. An empty or generic reason often signals a manual DETACH PARTITION.

  3. Correlate timestamps. Match modification_time from system.detached_parts against recent incidents. A cluster of detaches at the same time as replication network issues or a maintenance window narrows the root cause quickly.

  4. Check replication health for fetch-related detaches. If the reason involves a failed fetch, examine system.replication_queue on the affected replica. Look for entries with high num_tries and non-empty last_exception. Verify that the source replica is healthy and that inter-server network paths are clear. Fixing replication allows the system to fetch a replacement part from a healthy peer, making the detached copy redundant.

  5. Validate disk headroom before reattaching. Use system.disks to check free_space and unreserved_space. Reattaching a large detached part requires enough room to bring it back into the active set and potentially merge it. If unreserved_space is near zero, reattaching can push the server into a disk-space collapse where merges halt.

  6. Decide recovery or removal. For healthy parts detached manually or after transient errors, recovery with ATTACH restores the data. For corrupted parts, parts superseded by successful replication, or intentionally discarded data, removal with DROP DETACHED PART is correct. Never drop a detached part unless you have confirmed it is expendable or recoverable from another replica.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.detached_parts total sizeReveals disk consumption invisible to system.parts with active = 1Size increasing while active part count remains stable
system.detached_parts reasonDistinguishes automatic detachment from operator actionNon-empty reason referencing checksum, corrupt, or fetch errors
system.replication_queue exceptionsFailed fetches and merges can trigger automatic detachmentEntries with num_tries > 0 and non-empty last_exception
system.disks unreserved_spaceDetermines whether there is room to reattach parts or complete mergesunreserved_space trending toward zero
ReplicatedPartFailedFetchesDirect indicator of replication fetch stress that precedes detached partsCounter increasing relative to baseline
system.query_log DETACH DDLManual detaches not followed by cleanup leave permanent disk usageDETACH queries without subsequent ATTACH or DROP DETACHED PART

Fixes

Reattach healthy detached parts

If the part was detached manually or after a transient replication issue and the data is still valid, reattach it to bring the bytes back under MergeTree management.

ALTER TABLE database.table ATTACH PART 'part_name';
-- or for a full partition:
ALTER TABLE database.table ATTACH PARTITION tuple_value;

Before reattaching, verify that the reason column does not indicate corruption and that system.replication_queue does not already plan to replace the part with a fresh fetch. The tradeoff is that reattached parts re-enter the merge pipeline, which temporarily increases I/O and part count.

Remove unneeded or corrupted detached parts

If the part is corrupted, redundant after successful replication recovery, or intentionally discarded, remove it:

ALTER TABLE database.table DROP DETACHED PART 'part_name';

Warning: This action is destructive and irreversible. It frees disk immediately, but the data is gone permanently. Only run this after confirming the data exists on another replica or is truly expendable. For corrupted parts, run CHECK TABLE database.table on the active dataset first to ensure no additional damage exists.

Address replication fetch failures

When detached parts stem from replica sync issues, fix the replication path. Check system.replicas for is_readonly or is_session_expired, verify network connectivity between replicas on the inter-server port, and inspect source replica disk health. Once replication resumes, the replica may fetch merged replacements automatically, rendering the detached parts obsolete. You can then drop the obsolete detached parts safely.

Clean up after manual partition detaches

Operator-initiated DETACH PARTITION during incident response is a common source of orphaned detached parts. Review system.query_log for the original detach event, confirm with the team whether the data is still needed, and either reattach or drop accordingly. Establish a policy that every manual detach must be paired with a scheduled cleanup or reattachment ticket.

Prevention

  • Monitor system.detached_parts aggregate size actively. Active-part monitoring alone hides detached consumption.
  • Alert on replication fetch failures and queue exceptions. Failed fetches are a primary trigger for automatic detachment.
  • Require scheduled cleanup with every manual DETACH PARTITION. Operator detaches without follow-up are a common source of orphaned data.
  • Validate table health with CHECK TABLE periodically. Proactive checksum verification catches corruption before ClickHouse auto-detaches parts.
  • Track system.disks unreserved_space trends. Detached parts accelerate the path to a disk-space collapse where merges cannot complete.

How Netdata helps

  • Netdata collects OS disk metrics and system.disks.unreserved_space, surfacing space pressure even when active-part counts look healthy.
  • SQL metric collection from system.detached_parts exposes hidden disk consumers alongside standard active-part charts.
  • Correlation of disk spikes with replication error events or RejectedInserts distinguishes detached-part accumulation from merge death spirals.
  • Per-second disk monitoring catches rapid detached-part growth during replication incidents or batch DETACH operations.