$ guides / clickhouse / clickhouse-too-many-simultaneous-queries ▌

Operations Guides

ClickHouse Too many simultaneous queries: max_concurrent_queries and query storms

You run a query and ClickHouse returns “Too many simultaneous queries.” New connections either queue or fail outright. Queries that completed in seconds yesterday now time out. The server is not down, but it is not usable.

ClickHouse is optimized for fewer, heavier analytical queries. As concurrency rises, CPU, memory, and I/O contention increase non-linearly. A small spike can become a storm because each query is greedy.

The hard ceiling is max_concurrent_queries. The code default is 0 (unlimited), and the shipped configuration may override this. Production deployments often set it to 100. Once the running query count hits that limit, new queries are rejected or queued. The most common triggers are client retry amplification, a dashboard “refresh all” burst, or a runaway batch job opening many parallel connections.

flowchart TD
    A[Dashboard refresh-all or batch job] --> B[Concurrent query count spikes]
    B --> C[Slots fill toward max_concurrent_queries]
    C --> D[ClickHouse degrades non-linearly]
    D --> E[Latency rises sharply]
    E --> F[Client retries amplify load]
    F --> C

What this means

When concurrent queries reach max_concurrent_queries, ClickHouse stops accepting new work. Depending on configuration and client protocol, new queries may queue briefly or fail immediately with the “Too many simultaneous queries” error.

Each query can consume multiple threads, large memory buffers, and significant disk I/O. Unlike OLTP databases that handle hundreds of lightweight transactions, ClickHouse degrades non-linearly as concurrency increases. A query that uses one slot and 2 GB at low load may hold that slot ten times longer under contention, turning a temporary spike into a sustained pile-up.

Query storms create feedback loops. A client receives an error, retries immediately, and now even more queries compete for the same slots. Background merges and replication fetches share the same CPU, memory, and I/O pools, so query saturation starves the storage engine and worsens the spiral.

Common causes

Cause	What it looks like	First thing to check
Client retry amplification	Identical queries resubmitted rapidly from the same user after rejections	`system.processes` for repeating query patterns and `user`
Dashboard refresh-all	Sudden burst of SELECTs from BI tools or monitoring dashboards	`system.processes` filtered by `client_hostname` and `user`
Runaway batch job	ETL or analytics job opens many parallel connections	`system.processes` for long-running queries from batch service accounts
Connection pool overshoot	Client pool size exceeds `max_concurrent_queries`, causing systematic collisions	Client-side pool configuration versus the server limit
Latency pile-up	Slow queries hold slots longer, reducing effective throughput and causing backpressure	`system.query_log` for rising `query_duration_ms` at constant concurrency

Quick checks

Run these read-only checks to assess the current concurrency state.

-- Check running and preempted queries against the limit
SELECT metric, value
FROM system.metrics
WHERE metric IN ('Query', 'QueryPreempted');

-- Inspect live queries to find the heaviest consumers
SELECT
    query_id,
    user,
    client_hostname,
    elapsed,
    formatReadableSize(memory_usage) AS mem,
    substring(query, 1, 200) AS query_prefix
FROM system.processes
ORDER BY elapsed DESC
LIMIT 20;

-- Check cumulative failed query counters
SELECT event, value
FROM system.events
WHERE event IN ('FailedQuery', 'FailedSelectQuery', 'FailedInsertQuery');

-- Measure recent tail latency from finished queries
SELECT
    quantile(0.99)(query_duration_ms / 1000) AS p99_sec,
    count() AS query_count
FROM system.query_log
WHERE type = 'QueryFinish'
  AND is_initial_query = 1
  AND event_time > now() - INTERVAL 10 MINUTE;

-- Check server-level memory pressure
SELECT metric, value, formatReadableSize(value) AS readable
FROM system.metrics
WHERE metric = 'MemoryTracking';

-- Count active query execution threads
SELECT value FROM system.metrics WHERE metric = 'QueryThread';

# Check OS-level memory to catch untracked RSS pressure
pid=$(pidof clickhouse-server) && grep -E 'VmRSS|VmSize' /proc/$pid/status

How to diagnose it

Confirm you are at the ceiling. Compare the Query metric from system.metrics against max_concurrent_queries. If they are close, you are at the hard limit.
Identify who is consuming slots. Query system.processes ordered by elapsed or memory_usage. Look for clusters from the same user, client_hostname, or with similar query prefixes. This reveals whether the load is legitimate traffic or a misbehaving client.
Check for immediate query failures. Look at system.events for FailedQuery. If the counter is increasing while concurrency is pinned at the limit, the server is actively rejecting work.
Correlate with resource saturation. Check MemoryTracking in system.metrics. If it is approaching max_server_memory_usage, the system is killing or throttling queries, increasing slot hold time and worsening the storm. Check QueryThread to see if execution threads are saturated.
Determine if retry amplification is occurring. In system.query_log, look for bursts of identical queries with type = 'ExceptionWhileProcessing' followed by rapid re-execution from the same client host. This pattern confirms a retry loop.
Review latency trends. Query system.query_log for P99 query_duration_ms over the last hour. If latency is rising faster than the concurrency increase, ClickHouse has entered the non-linear degradation zone.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`Query` count from `system.metrics`	Active concurrency versus the hard ceiling	Sustained value near `max_concurrent_queries`
`QueryPreempted` from `system.metrics`	Queries waiting for resources	Non-zero value indicates queuing pressure
`FailedQuery` rate from `system.events`	Direct measure of rejections	Sudden spike correlating with concurrency peaks
Query latency P99 from `system.query_log`	Non-linear degradation under load	P99 rising faster than the concurrency increase
`MemoryTracking` from `system.metrics`	Memory pressure from concurrent heavy queries	Value approaching `max_server_memory_usage`
`QueryThread` from `system.metrics`	Active execution threads	Sustained high count indicating CPU saturation
Insert latency from `system.query_log`	Write pipeline pressure from concurrency contention	Rising insert times before hard rejections appear

Fixes

Immediate relief

Kill the heaviest running queries to free slots. Use query_id from system.processes.

-- WARNING: Killing queries is disruptive to the target workload.
KILL QUERY WHERE query_id = '...';

If a specific batch job or dashboard user is responsible, pause or disable that client before the retry loop restarts.

Client retry amplification

Add exponential backoff to client retry logic. Immediate retries against a saturated server compound the problem. If you control the client, reduce its connection pool size and keep it well below max_concurrent_queries.

Dashboard and BI query bursts

Stagger refresh intervals across panels. Route dashboard reads to pre-aggregated tables where possible, or increase client-side cache TTLs to avoid hammering the same expensive queries simultaneously.

Raise the limit (with caution)

If the ceiling is genuinely too low, increase max_concurrent_queries. Only do this if CPU, memory, and I/O metrics show headroom. Raising the limit without headroom pushes ClickHouse deeper into non-linear degradation, turning a query storm into an OOM kill or memory pressure death spiral.

Reduce per-query resource consumption

Set max_memory_usage per user or profile to prevent individual queries from monopolizing memory and holding slots indefinitely. For large aggregations, enable spill-to-disk with max_bytes_before_external_group_by and max_bytes_before_external_sort; this trades memory for I/O and can prevent slot hoarding.

Prevention

Size connection pools below the ceiling. Ensure all client connection pools sum to less than max_concurrent_queries.
Monitor P99 latency as an early warning. Rising latency precedes hard rejections by minutes to hours.
Set per-query timeouts and memory limits. Prevent a single heavy query from occupying a slot indefinitely.
Alert on sustained high Query count. A threshold at 80% of max_concurrent_queries gives you time to react before failure.
Review batch job scheduling. Separate large ETL windows from peak BI dashboard hours.

How Netdata helps

Correlates Query count with host CPU, memory, and disk latency to confirm whether concurrency is the bottleneck.
Alerts on sustained query counts approaching max_concurrent_queries before clients see rejections.
Tracks FailedQuery rate spikes to detect the onset of query storms.
Visualizes query latency P99 degradation, revealing non-linear saturation before the hard limit is reached.
Surfaces heavy queries by user and host during storms using system.processes dimensions.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse Too many simultaneous queries: max_concurrent_queries and query storms

ClickHouse Too many simultaneous queries: max_concurrent_queries and query storms

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Immediate relief

Client retry amplification

Dashboard and BI query bursts

Raise the limit (with caution)

Reduce per-query resource consumption

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata