Udp_RcvbufErrors: tuning kernel receive buffers for flow, trap, and syslog collectors

Udp_RcvbufErrors is incrementing on your flow collector. Flow charts show traffic declining during what is actually a traffic spike. The kernel is receiving datagrams from exporters but the socket receive buffer is full, so it drops them silently. No application-level counter moves. No error log fires. The dashboards lie downward while the real traffic goes upward.

Flow collectors (NetFlow v5/v9, IPFIX, sFlow), SNMP trap receivers (UDP 162), and syslog receivers (UDP 514) all depend on UDP socket buffers. When the buffer overflows, the kernel increments Udp_RcvbufErrors in /proc/net/snmp and discards the datagram. The application never sees it.

The common Linux default net.core.rmem_max of 4,194,304 bytes (4 MB) is the starting point for most incidents at this layer. Production flow collectors typically need 16 MB or more. Very high-pps sFlow collectors may need 33 MB. But raising the ceiling alone is not always the fix: the application must request a larger buffer via SO_RCVBUF, the parser must drain it fast enough, and the global UDP memory ceiling (net.ipv4.udp_mem) can impose a separate limit.

What this means

When a UDP datagram arrives, the kernel attempts to place it in the destination socket’s receive buffer. If the buffer is full because the application has not read from it fast enough, the kernel drops the datagram and increments Udp_RcvbufErrors. The counter is system-wide across all UDP sockets. It does not tell you which socket, which port, or which exporter was affected.

Two layers of drops exist, and they have different fixes:

  • NIC ring buffer drops happen at the hardware level, before the packet reaches the socket layer. Check /proc/net/dev RX drop columns and ethtool -S <iface> for counters like rx_missed_errors.
  • Socket buffer drops happen after the NIC has accepted the packet, at the kernel-to-application delivery boundary. Check Udp_RcvbufErrors and ss -lun -m Recv-Q.

Both must be monitored. If only one is rising, it narrows the problem. If both are rising, the entire receive path is saturated.

flowchart TD
    A[UdpRcvbufErrors incrementing] --> B{NIC RX drops rising too?}
    B -- Yes --> C[Fix NIC ring buffer and RSS first]
    B -- No --> D[Problem is at socket layer]
    D --> E{ss -m: Recv-Q near limit?}
    E -- No --> F[Check udp_mem global pressure]
    E -- Yes --> G{Collector CPU pattern?}
    G -- One core at 100% --> H[RSS misconfiguration]
    G -- System-wide high --> I[Parser or TSDB bottleneck]
    G -- Low or normal --> J[Undersized rmem_max]
    J --> K[Raise rmem_max + rmem_default]
    K --> L[Verify app sets SO_RCVBUF]

Common causes

CauseWhat it looks likeFirst thing to check
Undersized rmem_maxDrops proportional to incoming packet rate; ss -m shows Recv-Q at limitsysctl net.core.rmem_max
Slow consumer (parser or TSDB write blocked)Drops during bursts; collector CPU not fully utilized (I/O bound)Collector write queue depth or parser stats
RSS misconfigurationOne CPU core pinned at 100% while others are idle; drops during high ppscat /proc/interrupts | grep <iface>
Global UDP memory pressureDrops continue even after raising rmem_max and SO_RCVBUFcat /proc/sys/net/ipv4/udp_mem
Application not requesting larger bufferrmem_max raised but ss -m shows buffer still at old sizegetsockopt return value or app config

Quick checks

All read-only and safe to run on a production collector:

# System-wide UdpRcvbufErrors counter
nstat -az UdpRcvbufErrors

# Same data via /proc/net/snmp (RcvbufErrors column)
cat /proc/net/snmp | grep '^Udp:'

# Current socket buffer fill for a flow listener on port 2055
ss -lun '( sport = :2055 )' -m

# Current rmem_max and rmem_default
sysctl net.core.rmem_max net.core.rmem_default

# Global UDP memory pressure limits (min pressure max, in pages)
cat /proc/sys/net/ipv4/udp_mem

# NIC RX drops (happen before socket layer)
cat /proc/net/dev

# Detailed NIC drop counters
ethtool -S eth0 | grep -i drop

# Per-core CPU utilization and softirq distribution
mpstat -P ALL 1 5

# IRQ distribution for the NIC
cat /proc/interrupts | grep eth0

# Kernel packet processing backpressure
cat /proc/net/softnet_stat

How to diagnose it

  1. Confirm the counter is actively incrementing. Run nstat -az UdpRcvbufErrors twice, 30 seconds apart. The second value should be higher if drops are ongoing. A historically nonzero value that is not growing may represent a past incident already resolved.

  2. Check whether NIC-level drops are also rising. Read /proc/net/dev and ethtool -S <iface> for rx_missed_errors. If NIC drops are rising alongside Udp_RcvbufErrors, fix the NIC ring buffer and RSS first. The socket buffer overflow is a downstream symptom of packets arriving faster than the kernel can process them at all.

  3. Inspect the listener socket’s current buffer state. Run ss -lun '( sport = :2055 )' -m (replace 2055 with 6343 for sFlow, 4739 for IPFIX, 162 for traps, 514 for syslog). If Recv-Q is near the buffer limit, the application is not draining fast enough.

  4. Check the effective buffer size. The ss -m output shows the actual receive buffer allocated. If you raised rmem_max but the socket still shows the old size, the application has not called setsockopt(SO_RCVBUF) with the larger value, or it was started before the sysctl change. Already-running sockets do not pick up a new rmem_max automatically.

  5. Examine CPU utilization per core. Run mpstat -P ALL 1 5. A single core at 100% in the %soft column indicates RSS is funneling all packet processing to one CPU. System-wide high CPU indicates a parser or TSDB bottleneck.

  6. Check global UDP memory pressure. If drops persist after raising rmem_max and verifying SO_RCVBUF, read cat /proc/sys/net/ipv4/udp_mem. This sets the global UDP memory ceiling across all sockets (format: min pressure max, in pages). If aggregate UDP memory exceeds the pressure threshold, the kernel drops packets even when individual socket buffers have room.

  7. Compare device-side export counts against collector inbound rate. On a Cisco device, snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.387.1.4.4 returns cnfESPktsExported. If the device exported significantly more than the collector received, the gap is silent loss in transit or at the socket buffer. This is the only reliable end-to-end loss detection method.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
UdpRcvbufErrorsThe only direct kernel signal for socket buffer dropsAny nonzero increment in production
ss -m Recv-QShows real-time buffer fill per socketRecv-Q approaching buffer limit
UdpInDatagramsTotal UDP datagrams received, for computing drop ratioDrop ratio > 0.001 (0.1%)
NIC RX drops (/proc/net/dev)Drops at hardware layer, before socketAny nonzero RX drop rate on flow-ingress NIC
Per-core CPU %softIndicates RSS distribution problemsSingle core at 100% while others idle
Collector write queue depthSlow consumer backing up the bufferQueue growing without bound
Flow packets received rateIncoming load on the collectorSpike correlated with drop spike
udp_mem utilizationGlobal UDP memory pressureAggregate near pressure threshold
Flow inbound vs device exportedEnd-to-end loss detectionInbound significantly less than exported

Fixes

Raise rmem_max and rmem_default

The immediate fix for an undersized ceiling:

# Runtime change (takes effect for new sockets immediately)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.rmem_default=8388608

# Persistent configuration
cat >> /etc/sysctl.d/99-udp-collector.conf << 'EOF'
net.core.rmem_max = 16777216
net.core.rmem_default = 8388608
EOF
sysctl --system

Start with 16 MB for rmem_max and 8 MB for rmem_default. For very high-volume sFlow collectors, 33 MB may be necessary. Already-running sockets do not pick up the new rmem_max automatically. The collector process must restart or re-bind its listener socket for the new ceiling to take effect.

Verify the application sets SO_RCVBUF

Raising rmem_max sets the ceiling, but the application must explicitly request a larger buffer via setsockopt(SOL_SOCKET, SO_RCVBUF, size). The kernel internally doubles the requested value for bookkeeping overhead, so getsockopt() returns roughly 2x what was requested. This doubling is documented in socket(7) and is normal behavior.

If the application uses SO_RCVBUFFORCE (requires CAP_NET_ADMIN or root), it can exceed rmem_max. Some hardened or containerized builds disable SO_RCVBUFFORCE, causing the application to fall back to the unprivileged SO_RCVBUF path silently. Check the application documentation for how it configures receive buffers. For rsyslog’s imudp module, the rcvbufSize parameter controls this. If rsyslog drops privileges before opening the socket, the unprivileged SO_RCVBUF call may be capped at rmem_max.

Raise udp_mem under global pressure

If UdpRcvbufErrors persists after raising rmem_max and verifying SO_RCVBUF, the system may be hitting the global UDP memory ceiling. Read cat /proc/sys/net/ipv4/udp_mem (values are in pages, typically 4 KB each). If aggregate UDP memory is near the pressure value, raise the max field proportionally. Raising rmem_max alone allows more sockets to request large buffers, which increases aggregate kernel memory pressure. Under udp_mem pressure, the kernel drops packets aggressively even within individual socket limits. The fix is to raise both rmem_max and udp_mem.max together.

Fix RSS distribution

If one CPU core is at 100% in %soft while others are idle, RSS is funneling all flow traffic to a single core. Verify IRQ distribution with cat /proc/interrupts | grep <iface>. The fix is platform-specific. Some NICs require ethtool -X to set the RSS indirection table. Others need IRQ affinity adjustments via /proc/irq/<n>/smp_affinity. The goal is to distribute receive interrupts across multiple cores so no single core becomes the bottleneck.

Fix the consumer

If collector CPU is system-wide high (not just one core), the bottleneck is the parser or the TSDB write path, not the buffer size. Raising rmem_max buys time by absorbing bursts but does not fix the throughput problem. Identify whether the parser is CPU-bound (heavy regex on every record) or I/O-bound (TSDB write queue blocking the ingestion thread). Common fixes: simplify parsing logic, batch TSDB writes, move log files to a separate volume from the TSDB, or scale the collector horizontally.

Prevention

  • Set rmem_max and rmem_default before deploying collectors. Apply the sysctl configuration as part of host provisioning, not as incident response. 16 MB is a safe baseline; 33 MB for high-volume sFlow.
  • Monitor UdpRcvbufErrors continuously. Any nonzero increment is abnormal in production. Alert on it directly, not on a derived threshold.
  • Verify SO_RCVBUF after every collector restart. Confirm the effective buffer size with ss -lun -m. Configuration changes during upgrades can silently reset buffer settings.
  • In Kubernetes, apply sysctls inside the pod network namespace. Each pod has its own network namespace. Changing rmem_max on the host node does not affect pod containers unless the setting is applied inside the pod (privileged init container or DaemonSet). CNI plugins vary in whether they inherit host sysctls, so verify empirically.
  • On Azure AKS, the default rmem_max is 1,048,576 bytes (1 MB). This is insufficient for any moderately busy collector. Use linuxOSConfig in the Node Pool API to raise netCoreRmemMax and netCoreRmemDefault before deploying UDP-based collectors.
  • Separate the TSDB volume from log storage. Log growth on the same volume as the TSDB has caused collector outages when disk fills.
  • Monitor per-core CPU. RSS misconfiguration is invisible in aggregate CPU utilization. Track per-core %soft to catch single-core saturation before it causes drops.

How Netdata helps

  • Netdata collects UdpRcvbufErrors from /proc/net/snmp natively, with per-second resolution. Alert on any nonzero increment without manual instrumentation.
  • The ipv4 collector exposes the full UDP SNMP table, including UdpInDatagrams, UdpRcvbufErrors, and UdpInErrors. Correlating receive rate against drop rate gives you the loss ratio directly.
  • Per-core CPU metrics are collected by default, making RSS misconfiguration visible as one core at 100% while others are idle.
  • NIC RX and TX drop counters from /proc/net/dev and ethtool -S are collected natively. Correlating NIC drops against socket buffer drops narrows the problem to the correct layer.
  • If Netdata is your syslog or trap receiver, the same UdpRcvbufErrors counter applies. Netdata monitors its own ingestion health.
  • Disk space and I/O metrics on the collector host help detect TSDB write bottlenecks before they back up the UDP receive buffer.