ClickHouse client connections climbing: TCP 9000, HTTP 8123, and connection leaks
TCPConnection or HTTPConnection climbing on a ClickHouse node means each active socket consumes a file descriptor. When growth is uncorrelated with query throughput, it is usually a connection leak or pool misconfiguration rather than healthy concurrency. If the count approaches max_connections, the server rejects new client connections. If the Linux nofile limit is reached first, queries and merges fail with “too many open files.”
Unlike query concurrency, which is bounded by max_concurrent_queries, the connection count includes idle keep-alive sockets, health-check probes, interserver replication traffic, and short-lived retry attempts. A sustained upward trend while query throughput is flat is the signature of a leak. A step jump often points to a deploy changing a client pool, a load balancer configuration, or a retry storm after an upstream error such as TOO_MANY_PARTS or MEMORY_LIMIT_EXCEEDED.
The diagnostic goal is to separate connection pressure from real query pressure, identify the protocol responsible, and find the client or probe causing the growth.
What this means
ClickHouse exposes separate metrics for native TCP (port 9000), HTTP (port 8123), interserver replication (port 9009), and optional MySQL/PostgreSQL protocol connections in system.metrics. max_connections defaults to 4096 and caps the total number of client connections the server will accept. Each accepted connection consumes a handler thread, so very high connection counts can starve query execution even when CPU and memory are idle.
Because every socket is a file descriptor, connection growth also drives the process-wide /proc/<pid>/fd count. If a connection leak outruns the Linux nofile limit, the failure mode changes from connection rejection to file-open failures inside queries, merges, and replication fetches.
flowchart TD
A[Connection count climbing] --> B{Growth correlates with query load?}
B -->|Yes| C[Concurrent query pressure or retry storm]
B -->|No| D{Which protocol?}
D -->|TCP 9000| E[Native client pool leak or too many pools]
D -->|HTTP 8123| F[Keep-alive probe flood or scraper]
D -->|Interserver 9009| G[Replication catch-up or distributed query amplification]
C --> H[Check system.processes, query errors, and throughput]
E --> I[Audit client pool size and idle timeout]
F --> J[Audit health checks and monitoring clients]
G --> K[Check replication queue and distributed query plans]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Client connection leak | TCPConnection or HTTPConnection grows steadily while query throughput and system.processes are flat; often concentrated in one application user | Client pool config, idle timeout, and whether connections are explicitly closed |
| Oversized connection pool | Step jump in TCPConnection after a deploy; count stays high but active queries do not increase | Client pool max size and the number of client instances multiplied by pool max |
| Retry storm | Spiky connection count aligned with spikes in FailedQuery or insert rejections; many short-lived connections | Query error rate and exception codes; client timeout and backoff configuration |
| Load balancer or monitoring probe flood | HTTPConnection grows without corresponding queries; many lightweight requests from the same source | Health check frequency and endpoint; monitoring scrapers running tight loops |
| Interserver traffic surge | InterserverConnection high while client connections are normal; replication lag or large distributed queries | system.replication_queue depth and distributed query fan-out patterns |
| File descriptor limit too low | Connection counts are moderate but queries or merges fail with open-file errors; FD usage near the limit | /proc/<pid>/limits and current /proc/<pid>/fd count |
Quick checks
-- Active connection counts by protocol
SELECT metric, value
FROM system.metrics
WHERE metric IN (
'TCPConnection',
'HTTPConnection',
'InterserverConnection',
'MySQLConnection',
'PostgreSQLConnection'
);
-- Configured connection and concurrency limits
SELECT name, value
FROM system.settings
WHERE name IN ('max_connections', 'max_concurrent_queries');
# Open file descriptor count and process limit
PID=$(pgrep -f clickhouse-server | head -n 1)
test -n "$PID" || { echo "clickhouse-server PID not found"; exit 1; }
echo "open fds: $(ls -1 /proc/$PID/fd 2>/dev/null | wc -l)"
grep "Max open files" /proc/$PID/limits
-- Live query concurrency and long-running queries
SELECT
count(*) AS running_queries,
countIf(elapsed > 60) AS long_running
FROM system.processes;
-- Recent query throughput and failures
SELECT
countIf(type = 'QueryFinish') AS finished,
countIf(type = 'ExceptionWhileProcessing') AS failed
FROM system.query_log
WHERE event_time > now() - INTERVAL 5 MINUTE;
# Listening ports and interfaces
ss -tlnp | grep clickhouse
-- Recent client hosts to identify a source
SELECT
client_hostname,
count() AS queries
FROM system.query_log
WHERE event_time > now() - INTERVAL 5 MINUTE
AND is_initial_query = 1
GROUP BY client_hostname
ORDER BY queries DESC
LIMIT 10;
How to diagnose it
- Quantify the climb. Snapshot
TCPConnection,HTTPConnection, andInterserverConnectionfromsystem.metricsover several minutes. A sustained upward slope on one protocol while the others are flat points to a specific client class. - Compare to limits. Check
max_connectionsandmax_concurrent_queries. If connections are nearmax_connections, new client attempts will be rejected. If running queries are nearmax_concurrent_queries, the issue is query concurrency, not a connection leak. - Correlate with query load. Compare connection count to running query count from
system.processesand insert/select throughput fromsystem.query_log. Leaks show climbing connections without climbing queries. - Check file descriptor headroom. Count
/proc/<pid>/fdand compare to theMax open fileslimit. If FD usage is rising with connections, the nofile limit may be the first cliff. - Find the responsible clients. Query
system.query_logforclient_hostnameanduserover the growth window. A single host or service user dominating requests indicates the source. - Look for retry drivers. Check
system.query_logforExceptionWhileProcessingandsystem.eventsfor error counters. Spikes inTOO_MANY_PARTS,MEMORY_LIMIT_EXCEEDED, orTIMEOUT_EXCEEDEDoften trigger client retries that open new connections. - Inspect external probes. Verify that load balancer health checks point to
/pingand do not run heavy queries in tight loops. Monitoring scrapers should not open a new persistent connection per sample without closing the previous one. - Investigate interserver growth. If
InterserverConnectionis high, checksystem.replication_queuefor stuck or retrying entries and review distributed queries for missing shard-key filters orGLOBAL INamplification. - Verify OS and container limits. Ensure systemd
LimitNOFILE, Docker/containerdulimits, and Kubernetes pod limits are set high enough for production ClickHouse. The default 1024 is far too low. - Capture a baseline after fix. Once the source is identified, snapshot the same metrics to confirm the derivative returns to zero.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
TCPConnection | Native client load and leak indicator | > 80% of max_connections sustained |
HTTPConnection | HTTP client and probe load | > 80% of max_connections or uncorrelated with query load |
InterserverConnection | Replication and distributed query traffic | Sustained high while client connections are low |
| Process open FD count | Tracks file handles including sockets | > 70% of the process nofile limit |
Query (running queries) | Distinguishes concurrency from idle connections | Approaches or exceeds max_concurrent_queries |
| Failed query rate | Reveals retry storms and upstream errors | > 1% of total queries sustained |
| Query latency P99 | Captures impact of handler pool starvation | > 2x baseline for more than 15 minutes |
Fixes
Client connection leaks and oversized pools
Reduce the client pool max size and ensure connections are explicitly closed after use, including error paths. Set client idle timeouts so stale keep-alive sockets do not accumulate. If you have many service instances, reduce per-instance pool size rather than increasing the global pool. Tradeoff: smaller pools limit burst throughput; address that with larger insert batches rather than more connections.
Retry storms
Do not respond to transient errors by opening more connections. Add exponential backoff and circuit breakers on the client side, and fix the root cause in ClickHouse. Common drivers are TOO_MANY_PARTS, MEMORY_LIMIT_EXCEEDED, and TIMEOUT_EXCEEDED. Tradeoff: backoff increases perceived client latency but prevents connection exhaustion and amplification.
Health-check and monitoring floods
Configure load balancers and orchestration probes to use GET /ping on port 8123. Reduce probe frequency to the minimum required. Bind monitoring scrapers to internal interfaces and avoid opening a new connection per sample. Tradeoff: slower probes take longer to detect a dead process, but they avoid consuming handler threads.
Interserver traffic
For replication catch-up, relieve the bottleneck on the source replica (disk I/O, network, merge pool) so fetches complete and close connections. For distributed query amplification, enforce shard-key filters in WHERE clauses, avoid large GLOBAL IN subqueries, and review distributed_product_mode. See the related guide on distributed query amplification.
File descriptor pressure
Raise the process nofile limit to at least 100000 in production. In systemd, set LimitNOFILE. In containers, raise host and runtime ulimits. Do not rely on high limits alone; alert on the ratio of open FDs to the limit so leaks are still visible. Tradeoff: a very high limit can delay detection of a severe leak.
Prevention
- Alert on
TCPConnection / max_connectionsandHTTPConnection / max_connectionsabove 0.7. - Alert on process FD usage above 50% of the
nofilelimit. - Enforce client idle timeouts and pool caps in application configuration standards.
- Use
/pingfor liveness probes and keep monitoring queries lightweight. - Monitor failed query rate and insert rejections so retry storms are caught at the source.
- Document per-service connection budgets and review them after any client-side deploy.
- Keep
nofileat 100000 or higher on production nodes and containers.
Netdata correlation
Netdata correlates TCPConnection, HTTPConnection, InterserverConnection, running query count, and process FD usage on the same timeline to distinguish leaks from load. Derivative alerts on connection metrics detect climbing counts even when absolute values are below max_connections. Query latency, failed query rate, and insert rejection events overlay to identify the upstream error driving a retry storm. Netdata tracks per-process file descriptor usage and OS limits without manual /proc checks, and alerts on connection count and FD ratios approaching configured limits.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse mark cache and uncompressed cache: reading low hit rates
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse detached parts piling up: reading system.detached_parts and reclaiming space
- ClickHouse disk space collapse: why merges need free space and how the spiral starts
- ClickHouse disk space monitoring: free_space, unreserved_space, and the 80% target
- ClickHouse distributed DDL stuck: ON CLUSTER queries that never finish
- ClickHouse distributed query amplification: one coordinator, many shard subqueries
- ClickHouse full table scan: partition pruning failures and the primary key
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble







