ClickHouse client connections climbing: TCP 9000, HTTP 8123, and connection leaks

TCPConnection or HTTPConnection climbing on a ClickHouse node means each active socket consumes a file descriptor. When growth is uncorrelated with query throughput, it is usually a connection leak or pool misconfiguration rather than healthy concurrency. If the count approaches max_connections, the server rejects new client connections. If the Linux nofile limit is reached first, queries and merges fail with “too many open files.”

Unlike query concurrency, which is bounded by max_concurrent_queries, the connection count includes idle keep-alive sockets, health-check probes, interserver replication traffic, and short-lived retry attempts. A sustained upward trend while query throughput is flat is the signature of a leak. A step jump often points to a deploy changing a client pool, a load balancer configuration, or a retry storm after an upstream error such as TOO_MANY_PARTS or MEMORY_LIMIT_EXCEEDED.

The diagnostic goal is to separate connection pressure from real query pressure, identify the protocol responsible, and find the client or probe causing the growth.

What this means

ClickHouse exposes separate metrics for native TCP (port 9000), HTTP (port 8123), interserver replication (port 9009), and optional MySQL/PostgreSQL protocol connections in system.metrics. max_connections defaults to 4096 and caps the total number of client connections the server will accept. Each accepted connection consumes a handler thread, so very high connection counts can starve query execution even when CPU and memory are idle.

Because every socket is a file descriptor, connection growth also drives the process-wide /proc/<pid>/fd count. If a connection leak outruns the Linux nofile limit, the failure mode changes from connection rejection to file-open failures inside queries, merges, and replication fetches.

flowchart TD
    A[Connection count climbing] --> B{Growth correlates with query load?}
    B -->|Yes| C[Concurrent query pressure or retry storm]
    B -->|No| D{Which protocol?}
    D -->|TCP 9000| E[Native client pool leak or too many pools]
    D -->|HTTP 8123| F[Keep-alive probe flood or scraper]
    D -->|Interserver 9009| G[Replication catch-up or distributed query amplification]
    C --> H[Check system.processes, query errors, and throughput]
    E --> I[Audit client pool size and idle timeout]
    F --> J[Audit health checks and monitoring clients]
    G --> K[Check replication queue and distributed query plans]

Common causes

CauseWhat it looks likeFirst thing to check
Client connection leakTCPConnection or HTTPConnection grows steadily while query throughput and system.processes are flat; often concentrated in one application userClient pool config, idle timeout, and whether connections are explicitly closed
Oversized connection poolStep jump in TCPConnection after a deploy; count stays high but active queries do not increaseClient pool max size and the number of client instances multiplied by pool max
Retry stormSpiky connection count aligned with spikes in FailedQuery or insert rejections; many short-lived connectionsQuery error rate and exception codes; client timeout and backoff configuration
Load balancer or monitoring probe floodHTTPConnection grows without corresponding queries; many lightweight requests from the same sourceHealth check frequency and endpoint; monitoring scrapers running tight loops
Interserver traffic surgeInterserverConnection high while client connections are normal; replication lag or large distributed queriessystem.replication_queue depth and distributed query fan-out patterns
File descriptor limit too lowConnection counts are moderate but queries or merges fail with open-file errors; FD usage near the limit/proc/<pid>/limits and current /proc/<pid>/fd count

Quick checks

-- Active connection counts by protocol
SELECT metric, value
FROM system.metrics
WHERE metric IN (
    'TCPConnection',
    'HTTPConnection',
    'InterserverConnection',
    'MySQLConnection',
    'PostgreSQLConnection'
);
-- Configured connection and concurrency limits
SELECT name, value
FROM system.settings
WHERE name IN ('max_connections', 'max_concurrent_queries');
# Open file descriptor count and process limit
PID=$(pgrep -f clickhouse-server | head -n 1)
test -n "$PID" || { echo "clickhouse-server PID not found"; exit 1; }
echo "open fds: $(ls -1 /proc/$PID/fd 2>/dev/null | wc -l)"
grep "Max open files" /proc/$PID/limits
-- Live query concurrency and long-running queries
SELECT
    count(*) AS running_queries,
    countIf(elapsed > 60) AS long_running
FROM system.processes;
-- Recent query throughput and failures
SELECT
    countIf(type = 'QueryFinish') AS finished,
    countIf(type = 'ExceptionWhileProcessing') AS failed
FROM system.query_log
WHERE event_time > now() - INTERVAL 5 MINUTE;
# Listening ports and interfaces
ss -tlnp | grep clickhouse
-- Recent client hosts to identify a source
SELECT
    client_hostname,
    count() AS queries
FROM system.query_log
WHERE event_time > now() - INTERVAL 5 MINUTE
  AND is_initial_query = 1
GROUP BY client_hostname
ORDER BY queries DESC
LIMIT 10;

How to diagnose it

  1. Quantify the climb. Snapshot TCPConnection, HTTPConnection, and InterserverConnection from system.metrics over several minutes. A sustained upward slope on one protocol while the others are flat points to a specific client class.
  2. Compare to limits. Check max_connections and max_concurrent_queries. If connections are near max_connections, new client attempts will be rejected. If running queries are near max_concurrent_queries, the issue is query concurrency, not a connection leak.
  3. Correlate with query load. Compare connection count to running query count from system.processes and insert/select throughput from system.query_log. Leaks show climbing connections without climbing queries.
  4. Check file descriptor headroom. Count /proc/<pid>/fd and compare to the Max open files limit. If FD usage is rising with connections, the nofile limit may be the first cliff.
  5. Find the responsible clients. Query system.query_log for client_hostname and user over the growth window. A single host or service user dominating requests indicates the source.
  6. Look for retry drivers. Check system.query_log for ExceptionWhileProcessing and system.events for error counters. Spikes in TOO_MANY_PARTS, MEMORY_LIMIT_EXCEEDED, or TIMEOUT_EXCEEDED often trigger client retries that open new connections.
  7. Inspect external probes. Verify that load balancer health checks point to /ping and do not run heavy queries in tight loops. Monitoring scrapers should not open a new persistent connection per sample without closing the previous one.
  8. Investigate interserver growth. If InterserverConnection is high, check system.replication_queue for stuck or retrying entries and review distributed queries for missing shard-key filters or GLOBAL IN amplification.
  9. Verify OS and container limits. Ensure systemd LimitNOFILE, Docker/containerd ulimits, and Kubernetes pod limits are set high enough for production ClickHouse. The default 1024 is far too low.
  10. Capture a baseline after fix. Once the source is identified, snapshot the same metrics to confirm the derivative returns to zero.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
TCPConnectionNative client load and leak indicator> 80% of max_connections sustained
HTTPConnectionHTTP client and probe load> 80% of max_connections or uncorrelated with query load
InterserverConnectionReplication and distributed query trafficSustained high while client connections are low
Process open FD countTracks file handles including sockets> 70% of the process nofile limit
Query (running queries)Distinguishes concurrency from idle connectionsApproaches or exceeds max_concurrent_queries
Failed query rateReveals retry storms and upstream errors> 1% of total queries sustained
Query latency P99Captures impact of handler pool starvation> 2x baseline for more than 15 minutes

Fixes

Client connection leaks and oversized pools

Reduce the client pool max size and ensure connections are explicitly closed after use, including error paths. Set client idle timeouts so stale keep-alive sockets do not accumulate. If you have many service instances, reduce per-instance pool size rather than increasing the global pool. Tradeoff: smaller pools limit burst throughput; address that with larger insert batches rather than more connections.

Retry storms

Do not respond to transient errors by opening more connections. Add exponential backoff and circuit breakers on the client side, and fix the root cause in ClickHouse. Common drivers are TOO_MANY_PARTS, MEMORY_LIMIT_EXCEEDED, and TIMEOUT_EXCEEDED. Tradeoff: backoff increases perceived client latency but prevents connection exhaustion and amplification.

Health-check and monitoring floods

Configure load balancers and orchestration probes to use GET /ping on port 8123. Reduce probe frequency to the minimum required. Bind monitoring scrapers to internal interfaces and avoid opening a new connection per sample. Tradeoff: slower probes take longer to detect a dead process, but they avoid consuming handler threads.

Interserver traffic

For replication catch-up, relieve the bottleneck on the source replica (disk I/O, network, merge pool) so fetches complete and close connections. For distributed query amplification, enforce shard-key filters in WHERE clauses, avoid large GLOBAL IN subqueries, and review distributed_product_mode. See the related guide on distributed query amplification.

File descriptor pressure

Raise the process nofile limit to at least 100000 in production. In systemd, set LimitNOFILE. In containers, raise host and runtime ulimits. Do not rely on high limits alone; alert on the ratio of open FDs to the limit so leaks are still visible. Tradeoff: a very high limit can delay detection of a severe leak.

Prevention

  • Alert on TCPConnection / max_connections and HTTPConnection / max_connections above 0.7.
  • Alert on process FD usage above 50% of the nofile limit.
  • Enforce client idle timeouts and pool caps in application configuration standards.
  • Use /ping for liveness probes and keep monitoring queries lightweight.
  • Monitor failed query rate and insert rejections so retry storms are caught at the source.
  • Document per-service connection budgets and review them after any client-side deploy.
  • Keep nofile at 100000 or higher on production nodes and containers.

Netdata correlation

Netdata correlates TCPConnection, HTTPConnection, InterserverConnection, running query count, and process FD usage on the same timeline to distinguish leaks from load. Derivative alerts on connection metrics detect climbing counts even when absolute values are below max_connections. Query latency, failed query rate, and insert rejection events overlay to identify the upstream error driving a retry storm. Netdata tracks per-process file descriptor usage and OS limits without manual /proc checks, and alerts on connection count and FD ratios approaching configured limits.