ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly

You finish a large batch query, open monitoring, and see ClickHouse MemoryTracking drop by 40 GB while MemoryResident barely moves. Or you watch RSS climb for hours after a restart while MemoryTracking tracks the rise steadily. Neither pattern indicates a leak. The gap between ClickHouse’s internal ledger and OS resident set size is normal: the server accounts for memory synchronously while jemalloc retains pages for reuse.

This article explains the mechanism behind the divergence, when the gap is expected, and which metric to trust for which operational decision.

What it is and why it matters

MemoryTracking lives in system.metrics. It is the sum of all allocations passing through ClickHouse’s hierarchical MemoryTracker: server-level totals, per-user aggregates, and per-query working sets. Every GROUP BY hash table, JOIN buffer, mark cache entry, merge scratch buffer, and decompression block that ClickHouse explicitly manages is added and subtracted synchronously on alloc and free.

MemoryResident lives in system.asynchronous_metrics. It is the RSS of the clickhouse-server process sampled in the background. This is what the Linux OOM killer watches and what container runtimes enforce. Because it includes allocator-retained pages, memory-mapped files, and library overhead that ClickHouse never tracks, it is always equal to or larger than the internally accounted figure.

The two metrics serve different operational purposes. MemoryTracking tells you which query or subsystem is consuming space and whether a single operation is about to hit a per-query or server-wide limit. MemoryResident tells you whether the process is about to be killed by the kernel. Treating MemoryTracking as the single source of truth routinely underestimates true footprint. Treating a post-query RSS plateau as a leak wastes hours profiling a healthy allocator.

How it works

ClickHouse maintains a tree of MemoryTracker objects. When a query executes, its allocator calls track(alloc) and track(free) against the query’s node, which rolls up to the user and server totals. This gives precise attribution: system.processes.memory_usage shows the current allocation, and system.processes.peak_memory_usage shows the high-water mark.

However, ClickHouse does not return freed pages to the operating system immediately. The server uses jemalloc, which holds deallocated memory in thread-local caches and dirty-page bins for reuse. When a query frees 30 GB of hash tables, those pages are deducted from MemoryTracking immediately because the tracker records the logical free. The pages remain in the process resident set, available for the next allocation. This is the primary driver of the gap after large queries finish.

jemalloc organizes memory into arenas and size classes. Freed large allocations often sit in dirty or muzzy states until the allocator triggers a background purge or the process faces memory pressure. You may observe RSS drop suddenly during a system-level memory shortage even though ClickHouse workload and MemoryTracking remain unchanged.

Untracked allocations widen the divergence further. Memory-mapped files, certain third-party library buffers, and jemalloc’s own metadata sit outside the tracked hierarchy. The approximate relationship is:

MemoryResident = MemoryTracking + jemalloc retained/dirty pages + untracked allocations

This is why RSS can exceed MemoryTracking by a wide margin on a warm node, and why the delta can spike right after a query ends even though no new memory was allocated.

The server-level MemoryTracker enforces max_server_memory_usage. When the sum of tracked allocations approaches this limit, new queries may receive MEMORY_LIMIT_EXCEEDED errors even if the OS reports available RAM. ClickHouse prefers to fail queries rather than invite the OOM killer. The gap between tracked and resident memory determines how much headroom actually exists before the kernel intervenes.

Caches are another intentional, tracked consumer. The mark cache and uncompressed cache are accounted inside MemoryTracking. On a warm analytical node, a large portion of MemoryTracking is simply these caches doing their job. You can see their current sizes through system.metrics (MarkCacheBytes) and system.asynchronous_metrics (UncompressedCacheBytes).

flowchart TD
    RSS[MemoryResident
RSS] subgraph Tracked ["Accounted in MemoryTracking"] MT[Query buffers
caches
merges] end subgraph Gap ["The gap"] JEM[jemalloc retained/dirty pages] UNTR[Untracked mmap
and library overhead] end Tracked --> RSS Gap --> RSS

Where it shows up in production

Post-query plateau. A batch job with a heavy aggregation allocates tens of gigabytes of hash tables. MemoryTracking rises with the query and falls when it ends. RSS rises too, but it stays elevated for minutes afterward. If you only watch RSS, you think the memory leaked. If you only watch MemoryTracking, you think the node has plenty of headroom. The truth is that jemalloc retained the pages for reuse and will likely surrender them only under sustained pressure or when the allocator decides to purge.

Steady-state cache warmup. After startup, ClickHouse populates the mark cache and uncompressed cache lazily as queries touch parts. MemoryTracking climbs for hours as caches warm. RSS climbs with it. This is expected behavior. The plateau you reach is the working set, not a leak waiting to happen.

Containerized deployments. The mismatch becomes dangerous when the cgroup memory limit is compared against the wrong metric. If your orchestrator kills the pod based on RSS but your alert threshold is set on MemoryTracking, you will miss the approaching OOM. If you tune max_server_memory_usage based on RSS without understanding how much of it is cache, you may starve legitimate cache space and degrade query performance.

Gap growth with flat MemoryTracking. If the delta between resident and tracked memory grows steadily while MemoryTracking stays flat and cache sizes are stable, suspect untracked allocations or allocator fragmentation. Check for large anonymous mappings or investigate whether third-party libraries are allocating outside the tracked hierarchy.

If you need to test whether RSS is held by tracked caches, you can drop them. This degrades query performance until caches refill, so run only during low traffic or in staging:

-- Degrades performance until caches rebuild. Use with caution.
SYSTEM DROP MARK CACHE;
SYSTEM DROP UNCOMPRESSED CACHE;

If MemoryTracking falls significantly, the gap was tracked cache. If the gap remains, the memory is allocator-retained or untracked.

Common misuses and misreadings

Using MemoryTracking alone for OOM avoidance. Because RSS includes untracked and retained pages, it can exceed MemoryTracking by a significant margin. The OOM killer acts on RSS, not on the internal ledger.

Treating RSS persistence after query completion as a memory leak. jemalloc holds pages for reuse. The memory is available to the process for subsequent allocations. A true leak shows as both MemoryTracking and RSS growing without bound under stable load.

Assuming MemoryTracking and MemoryResident should match. They are designed to diverge. The operational question is whether the divergence is stable or growing without bound.

Panicking over negative MemoryTracking spikes. Rapid deallocations can briefly drive the signed Int64 counter negative before it stabilizes on the next synchronous update. This is cosmetic. It does not indicate corruption and it self-corrects.

Signals to watch in production

SignalWhy it mattersWarning sign
MemoryTrackingInternal ledger of all ClickHouse-accounted allocationsSustained growth without corresponding query or cache load increase
MemoryResidentRSS visible to the OOM killer and cgroup enforcerApproaching physical RAM limit or container memory cap
Resident minus Tracking gapjemalloc fragmentation or untracked allocationsPersistent growth of the gap without corresponding cache growth
MarkCacheBytesTracked index cache; large on warm nodes is normalUnexpected shrinkage may indicate memory pressure eviction
UncompressedCacheBytesTracked decompressed block cache if enabledZero when enabled may mean pressure; large size is expected
Peak query memoryPer-query attribution in system.processesSingle query using more than 50% of max_server_memory_usage
max_server_memory_usage headroomServer limit, defaulting to 90% of physical RAMTracked memory staying above 80% of this limit during peak

The system.processes table exposes both memory_usage and peak_memory_usage for running queries. A query whose peak approaches the per-query limit may still be well under the server limit, but a query that uses more than half of max_server_memory_usage can starve concurrent workloads. Watch for single-query dominance in the per-process breakdown.

Pull the core counters side by side with:

-- Check tracked memory
SELECT value, formatReadableSize(value) AS readable
FROM system.metrics
WHERE metric = 'MemoryTracking';

-- Check resident memory
SELECT metric, value, formatReadableSize(value) AS readable
FROM system.asynchronous_metrics
WHERE metric IN ('MemoryResident', 'MemoryVirtual');

-- Find recent heavy queries by peak tracked memory
SELECT query_id, formatReadableSize(peak_memory_usage) AS peak
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
ORDER BY peak_memory_usage DESC
LIMIT 10;

Or check the OS view directly:

# OS-level process memory. pgrep -x ensures an exact match.
cat /proc/$(pgrep -x clickhouse-server | head -1)/status | grep -E 'VmRSS|VmSize|VmPeak'

How Netdata helps

Netdata charts MemoryTracking, MemoryResident, and the gap on the same timeline. Correlate RSS steps with per-query peaks from system.processes to distinguish allocator retention from runaway queries. Track cache sizes alongside total memory to distinguish legitimate cache growth from unexpected allocation. Alert on OS RSS approaching physical or cgroup limits independently of internal tracked counters.