ClickHouse Too many simultaneous queries: max_concurrent_queries and query storms

You run a query and ClickHouse returns “Too many simultaneous queries.” New connections either queue or fail outright. Queries that completed in seconds yesterday now time out. The server is not down, but it is not usable.

ClickHouse is optimized for fewer, heavier analytical queries. As concurrency rises, CPU, memory, and I/O contention increase non-linearly. A small spike can become a storm because each query is greedy.

The hard ceiling is max_concurrent_queries. The code default is 0 (unlimited), and the shipped configuration may override this. Production deployments often set it to 100. Once the running query count hits that limit, new queries are rejected or queued. The most common triggers are client retry amplification, a dashboard “refresh all” burst, or a runaway batch job opening many parallel connections.

flowchart TD
    A[Dashboard refresh-all or batch job] --> B[Concurrent query count spikes]
    B --> C[Slots fill toward max_concurrent_queries]
    C --> D[ClickHouse degrades non-linearly]
    D --> E[Latency rises sharply]
    E --> F[Client retries amplify load]
    F --> C

What this means

When concurrent queries reach max_concurrent_queries, ClickHouse stops accepting new work. Depending on configuration and client protocol, new queries may queue briefly or fail immediately with the “Too many simultaneous queries” error.

Each query can consume multiple threads, large memory buffers, and significant disk I/O. Unlike OLTP databases that handle hundreds of lightweight transactions, ClickHouse degrades non-linearly as concurrency increases. A query that uses one slot and 2 GB at low load may hold that slot ten times longer under contention, turning a temporary spike into a sustained pile-up.

Query storms create feedback loops. A client receives an error, retries immediately, and now even more queries compete for the same slots. Background merges and replication fetches share the same CPU, memory, and I/O pools, so query saturation starves the storage engine and worsens the spiral.

Common causes

CauseWhat it looks likeFirst thing to check
Client retry amplificationIdentical queries resubmitted rapidly from the same user after rejectionssystem.processes for repeating query patterns and user
Dashboard refresh-allSudden burst of SELECTs from BI tools or monitoring dashboardssystem.processes filtered by client_hostname and user
Runaway batch jobETL or analytics job opens many parallel connectionssystem.processes for long-running queries from batch service accounts
Connection pool overshootClient pool size exceeds max_concurrent_queries, causing systematic collisionsClient-side pool configuration versus the server limit
Latency pile-upSlow queries hold slots longer, reducing effective throughput and causing backpressuresystem.query_log for rising query_duration_ms at constant concurrency

Quick checks

Run these read-only checks to assess the current concurrency state.

-- Check running and preempted queries against the limit
SELECT metric, value
FROM system.metrics
WHERE metric IN ('Query', 'QueryPreempted');
-- Inspect live queries to find the heaviest consumers
SELECT
    query_id,
    user,
    client_hostname,
    elapsed,
    formatReadableSize(memory_usage) AS mem,
    substring(query, 1, 200) AS query_prefix
FROM system.processes
ORDER BY elapsed DESC
LIMIT 20;
-- Check cumulative failed query counters
SELECT event, value
FROM system.events
WHERE event IN ('FailedQuery', 'FailedSelectQuery', 'FailedInsertQuery');
-- Measure recent tail latency from finished queries
SELECT
    quantile(0.99)(query_duration_ms / 1000) AS p99_sec,
    count() AS query_count
FROM system.query_log
WHERE type = 'QueryFinish'
  AND is_initial_query = 1
  AND event_time > now() - INTERVAL 10 MINUTE;
-- Check server-level memory pressure
SELECT metric, value, formatReadableSize(value) AS readable
FROM system.metrics
WHERE metric = 'MemoryTracking';
-- Count active query execution threads
SELECT value FROM system.metrics WHERE metric = 'QueryThread';
# Check OS-level memory to catch untracked RSS pressure
pid=$(pidof clickhouse-server) && grep -E 'VmRSS|VmSize' /proc/$pid/status

How to diagnose it

  1. Confirm you are at the ceiling. Compare the Query metric from system.metrics against max_concurrent_queries. If they are close, you are at the hard limit.

  2. Identify who is consuming slots. Query system.processes ordered by elapsed or memory_usage. Look for clusters from the same user, client_hostname, or with similar query prefixes. This reveals whether the load is legitimate traffic or a misbehaving client.

  3. Check for immediate query failures. Look at system.events for FailedQuery. If the counter is increasing while concurrency is pinned at the limit, the server is actively rejecting work.

  4. Correlate with resource saturation. Check MemoryTracking in system.metrics. If it is approaching max_server_memory_usage, the system is killing or throttling queries, increasing slot hold time and worsening the storm. Check QueryThread to see if execution threads are saturated.

  5. Determine if retry amplification is occurring. In system.query_log, look for bursts of identical queries with type = 'ExceptionWhileProcessing' followed by rapid re-execution from the same client host. This pattern confirms a retry loop.

  6. Review latency trends. Query system.query_log for P99 query_duration_ms over the last hour. If latency is rising faster than the concurrency increase, ClickHouse has entered the non-linear degradation zone.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Query count from system.metricsActive concurrency versus the hard ceilingSustained value near max_concurrent_queries
QueryPreempted from system.metricsQueries waiting for resourcesNon-zero value indicates queuing pressure
FailedQuery rate from system.eventsDirect measure of rejectionsSudden spike correlating with concurrency peaks
Query latency P99 from system.query_logNon-linear degradation under loadP99 rising faster than the concurrency increase
MemoryTracking from system.metricsMemory pressure from concurrent heavy queriesValue approaching max_server_memory_usage
QueryThread from system.metricsActive execution threadsSustained high count indicating CPU saturation
Insert latency from system.query_logWrite pipeline pressure from concurrency contentionRising insert times before hard rejections appear

Fixes

Immediate relief

Kill the heaviest running queries to free slots. Use query_id from system.processes.

-- WARNING: Killing queries is disruptive to the target workload.
KILL QUERY WHERE query_id = '...';

If a specific batch job or dashboard user is responsible, pause or disable that client before the retry loop restarts.

Client retry amplification

Add exponential backoff to client retry logic. Immediate retries against a saturated server compound the problem. If you control the client, reduce its connection pool size and keep it well below max_concurrent_queries.

Dashboard and BI query bursts

Stagger refresh intervals across panels. Route dashboard reads to pre-aggregated tables where possible, or increase client-side cache TTLs to avoid hammering the same expensive queries simultaneously.

Raise the limit (with caution)

If the ceiling is genuinely too low, increase max_concurrent_queries. Only do this if CPU, memory, and I/O metrics show headroom. Raising the limit without headroom pushes ClickHouse deeper into non-linear degradation, turning a query storm into an OOM kill or memory pressure death spiral.

Reduce per-query resource consumption

Set max_memory_usage per user or profile to prevent individual queries from monopolizing memory and holding slots indefinitely. For large aggregations, enable spill-to-disk with max_bytes_before_external_group_by and max_bytes_before_external_sort; this trades memory for I/O and can prevent slot hoarding.

Prevention

  • Size connection pools below the ceiling. Ensure all client connection pools sum to less than max_concurrent_queries.
  • Monitor P99 latency as an early warning. Rising latency precedes hard rejections by minutes to hours.
  • Set per-query timeouts and memory limits. Prevent a single heavy query from occupying a slot indefinitely.
  • Alert on sustained high Query count. A threshold at 80% of max_concurrent_queries gives you time to react before failure.
  • Review batch job scheduling. Separate large ETL windows from peak BI dashboard hours.

How Netdata helps

  • Correlates Query count with host CPU, memory, and disk latency to confirm whether concurrency is the bottleneck.
  • Alerts on sustained query counts approaching max_concurrent_queries before clients see rejections.
  • Tracks FailedQuery rate spikes to detect the onset of query storms.
  • Visualizes query latency P99 degradation, revealing non-linear saturation before the hard limit is reached.
  • Surfaces heavy queries by user and host during storms using system.processes dimensions.