Collector CPU and TSDB write-queue saturation: the capacity signals
When a network monitoring collector saturates, the first visible symptom is rarely high collector CPU. It is traffic charts showing a decline during a traffic spike, an SNMP poll cycle drifting past its configured interval, or unexplained gaps in flow data. The degradation sits one to three subsystems downstream of the actual bottleneck, which is why collector-side incidents are frequently misdiagnosed.
This reference covers the capacity signals that precede data loss. The signals are organized by the data path through the collector: NIC receive, kernel socket buffer, parser and aggregator threads, TSDB write queue, and disk. Each stage has its own saturation signature, degradation curve, and leading indicators.
Use this as a checklist for capacity monitoring on flow collectors, SNMP pollers, syslog receivers, trap receivers, and the TSDB backing them. The vendor-specific section covers OpenTelemetry Collector, Prometheus remote_write, VictoriaMetrics, Telegraf, and ntopng/nProbe.
The saturation path
Collector saturation is a cascade, not a single event. A packet arrives at the NIC ring buffer, traverses the kernel socket buffer, reaches the parser and aggregator, enters the TSDB write queue, and is flushed to disk. A bottleneck at any stage backs up everything upstream of it. Each stage requires a different remediation: NIC ring buffer tuning, socket buffer sizing, parser optimization, queue capacity increases, or disk IOPS.
flowchart TD
A["NIC ring buffer
rx_missed_errors"] --> B["UDP socket buffer
Udp_RcvbufErrors"]
B --> C["Parser / aggregator
per-core %soft, %user"]
C --> D["TSDB write queue
depth vs capacity"]
D --> E["Disk / storage
free space, iostat await"]
E -->|backpressure| D
D -->|backpressure| C
C -->|slow drain| BDownward arrows represent data movement. Upward arrows represent backpressure. When disk I/O saturates, the TSDB write queue grows. When the queue grows, the parser blocks on enqueue. When the parser blocks, the socket buffer drains slowly. When the socket buffer overflows, the kernel drops packets silently and Udp_RcvbufErrors increments. Every signal below sits at one of these five stages.
Collector CPU signals
Aggregate CPU is the least useful signal on a multi-core collector. A 16-core machine with one core pinned at 100% from Receive Side Scaling (RSS) funneling shows roughly 6% aggregate. The bottleneck is real, but the aggregate hides it. Always read per-core utilization.
| Signal | What it means | Where to read it | Threshold |
|---|---|---|---|
| Per-core %soft (softirq) | Kernel packet processing on a specific core. One core at 100% with others idle indicates RSS funneling all interrupts to one CPU. | mpstat -P ALL 1 5 | Any core at 100% sustained with others idle |
| Per-core %sys | Kernel overhead, often context switching or packet processing. | mpstat -P ALL 1 5 | Rising alongside %soft |
| Per-core %user on collector process | Parser or aggregator bottleneck in user space. Regex-heavy per-record processing is a common cause. | top -H -p $(pgrep -d, <collector>) | Sustained high %user on parser threads |
| Load average (1-min) | Sustained oversubscription. | /proc/loadavg | > 0.7 x core count sustained |
| NET_RX / NET_TX softirq rate | Kernel receive and transmit processing load. | watch -n1 'cat /proc/softirqs' | Rate proportional to incoming packet rate |
| IRQ distribution across cores | Verifies RSS distributes packet interrupts across cores rather than funneling to one. | grep <iface> /proc/interrupts | Single core receiving most interrupts |
Collector CPU at > 90% sustained for more than 5 minutes means data loss is imminent. Above 70% sustained warrants investigation. High %soft on the NIC-receive core is expected during high packet rates; high %soft on unrelated cores suggests RSS misconfiguration or a driver issue. Use ethtool -S <iface> to check for rx_missed_errors, which indicate the NIC hardware ring buffer itself is dropping packets before the kernel can process them.
Correlate CPU signals with UDP socket buffer drops and TSDB write queue depth. A rising flow receive rate combined with rising per-core %soft and rising Udp_RcvbufErrors unambiguously identifies collector-side overload.
TSDB write-queue and disk signals
The TSDB write queue buffers between the parser and storage. When it grows, data is produced faster than it can be persisted. Disk fills are cliff events: the TSDB stops accepting writes and data is lost. Write-queue growth is more gradual but ends the same way.
| Signal | What it means | Where to read it | Threshold |
|---|---|---|---|
| Disk free percentage | Approaching the cliff where TSDB stops accepting writes. | df -h /var/lib/<tsdb> | < 20% TICKET, < 10% PAGE |
| TSDB write queue depth | Backlog of unwritten samples. Growing without bound means the TSDB cannot keep up with ingestion. | Collector stats endpoint or vendor metric | > 2x rolling 1-hour average TICKET, unbounded growth PAGE |
| iostat %util | Disk busy percentage during write bursts. | iostat -xz 1 5 | Approaching 100% sustained |
| iostat await | Average I/O latency. Rising await means disk performance is degrading under load. | iostat -xz 1 5 | > 20ms is a leading indicator |
| Series cardinality | Number of distinct time series the TSDB tracks. The silent inflation driver. A single /24 subnet added to a flow collector can add tens of thousands of new series. | TSDB introspection or stats endpoint | Growth > 5%/week is concerning |
| Disk fill rate | Bytes written per day. Accelerating fill rate without a known cause often points to cardinality inflation. | df over time, or TSDB ingest metrics | Accelerating beyond 7-day trend |
Two production gotchas: first, logs on the same volume as the TSDB have caused outages; use separate volumes. Second, some TSDBs (Prometheus, VictoriaMetrics) compact periodically, causing disk I/O spikes that look like saturation but are normal. Correlate compaction windows with I/O spikes before alerting.
Leading indicators and runway estimation
Each contested resource has a degradation curve and a runway. The curve tells you how failure arrives: cliff or gradual. The runway tells you how long you have before impact.
| Resource | Leading indicator | Degradation curve | Headroom target |
|---|---|---|---|
| Collector CPU (parser/aggregator) | Parser throughput (records/sec) vs incoming rate (packets/sec x records/packet). Per-core %soft rising. | Gradual then cliff. Parser slows, queue grows, latency rises, eventually buffer drops begin. | Parser capacity > 2x peak incoming rate. CPU < 60%. |
| TSDB write queue | Queue depth trending up. Write latency rising. | Graceful then cliff. Latency increases, then backpressure, then drops. | Queue depth at baseline. No enqueue failures. |
| Disk space | Free bytes trending down. Fill rate accelerating. | Cliff at 0% free. TSDB stops accepting writes. | > 30% free. 7+ days runway at current growth rate. |
| TSDB cardinality | Series count trending up. Distinct label values increasing. TSDB process memory rising. | Gradual inflation then cliff on memory exhaustion or query performance collapse. | Growth rate < 1%/week sustained. No unexpected jumps. |
| Worker thread pool | Worker queue depth growing. Processing latency rising. Timeout rate per minute. | Soft saturation then cliff. Latency rises, timeouts appear, false device down alerts cascade. | Utilization < 50% of worker capacity. Queue depth < 25% of max. Timeout rate 0%. |
| UDP socket buffer | Udp_RcvbufErrors incrementing. ss -lun -m showing Recv-Q approaching buffer limit. | Cliff. Once the buffer overflows, every additional packet is dropped. | Buffer sized to absorb burst. 16 MB+ for high-pps collectors. |
Runway estimation formulas:
- Disk:
days_to_fill = free_bytes / bytes_per_day_trend, where the trend is calculated over the past 7 days to account for weekday/weekend variation. - TSDB cardinality: multiply the disk trend by the cardinality growth rate. If cardinality is growing at 5% per week, expect disk consumption to accelerate proportionally.
- Poller workers:
time_to_saturation = current_capacity * (1 - current_utilization) / recent_growth_rate. Always apply a 50% safety margin.
Vendor-specific queue-depth signals
The capacity signals above apply to any collector. The metrics below are specific to the major open-source stacks.
OpenTelemetry Collector. The critical queue metrics are otelcol_exporter_queue_size (current depth relative to capacity), otelcol_exporter_enqueue_failed_spans (increments when the export queue rejects data because it is full), and otelcol_processor_refused_spans (data refused by the memory_limiter; should be zero in steady state). Scale up when the queue is sustained above 60-70% of capacity. Scale down when consistently below 20%. A separate signal, otelcol_exporter_send_failed_spans, increments on permanent export failures such as HTTP 4xx or connection refused; adding replicas does not fix this.
The memory_limiter processor historically refuses data with a retryable error when the soft limit is breached, relying on upstream receivers to re-queue. This creates a feedback loop with bounded queues that can amplify data loss. A drop instead of refuse mode is tracked but . Pair the memory_limiter with GOMEMLIMIT set to 80-90% of the container or host memory limit to give the Go runtime proactive GC headroom.
Prometheus remote_write. All backpressure knobs live under queue_config. The default capacity is 10,000 samples per shard. The default max_samples_per_send is 2,000. The default max_shards is 30 . The key diagnostic metric is prometheus_remote_storage_samples_pending, which tracks samples waiting in the shard queue. Lag between prometheus_remote_storage_queue_highest_sent_timestamp_seconds and prometheus_remote_storage_highest_timestamp_in_seconds signals queue backlog.
A recent Prometheus commit tightened remote_write resharding logic to prevent deadlocks. Older configs with manually inflated capacity values may behave differently now. Validate that per-shard memory remains reasonable.
VictoriaMetrics. The primary signal is vmagent_remotewrite_pending_data_bytes: bytes scraped but not yet sent to the remote write target. Connection saturation is diagnosed with max(rate(vm_rpc_send_duration_seconds_total{}[1m])) by(addr). A value of 0.9 seconds means the connection is more than 90% saturated. Under sustained write load, vminsert nodes can become CPU- or network-saturated between themselves and vmstorage nodes, causing the remote write client to fall behind without explicit queue-full errors.
Telegraf. Telegraf drops metrics when its internal buffer reaches metric_buffer_limit. The log message “Metric buffer limit exceeded” appears when the buffer cannot drain faster than data arrives. The default metric_buffer_limit is 10,000 . When multiple InfluxDB outputs are configured and at least one is reachable, Telegraf resets the buffer size counter even if a failed output still has queued data. This means internal_buffer_size can mislead operators into believing the buffer is draining when one output is permanently backed up.
ntopng/nProbe. The critical socket receive buffer warning appears when /proc/sys/net/core/rmem_max is below 8,388,608 bytes (8 MB). The log message instructs operators to increase it. The Linux default net.core.rmem_max is often 212992 bytes , which is inadequate for high-pps flow collectors. Production deployments should target 16 MB or higher.
How Netdata helps
Netdata monitors the full saturation path end to end, correlating signals across stages that operators normally check in isolation:
- Per-core CPU breakdown including %soft (softirq), %sys, and %user, so RSS funneling to a single core is visible without custom mpstat dashboards.
- UDP socket buffer drops via kernel counters (
Udp_RcvbufErrors), with anomaly detection that surfaces the first nonzero increment rather than waiting for a fixed threshold. - Disk utilization, free space, and I/O latency (
await,%util) on the TSDB volume, with fill-rate trending for runway estimation. - Collector process metrics when Netdata runs alongside the NPM stack, exposing parser thread CPU and write-queue depth where the vendor exposes them.
- Cross-signal correlation between rising flow receive rate, rising UDP buffer drops, and rising TSDB write queue depth, which together unambiguously identify collector-side saturation versus exporter-side or network-side issues.
Related guides
- ARP cache staleness: when IP-to-MAC mapping goes bad
- Asymmetric routing: why your path and latency measurements lie
- Audit log gaps: detecting syslog/trap tampering or loss
- BGP flapping: why a peer keeps resetting and how to find the cause
- BGP NOTIFICATION and Cease messages: what each subcode is telling you
- BGP RIB and FIB growth: monitoring route-table size before it bites
- BGP route leak and hijack: the detection signals and alerts that matter
- BGP session Established but stale: detecting silent route loss
- Correlating cloud VPC flow logs with on-prem NetFlow
- Cold-start topology: why your map is incomplete after a collector restart
- Locating endpoints behind NAT and wireless: the positioning problem
- Stale FDB/MAC tables: why endpoint location is wrong







