Microbursts: catching sub-second congestion that minute averages hide

Your switches are dropping packets. The utilization charts say everything is fine. Interface counters show moderate load, error rates are clean, and no congestion alert has fired. But applications report retransmissions, latency spikes, and intermittent connectivity. The problem resolved between polls.

A microburst is a short, intense spike of traffic that fills a switch egress queue faster than the queue can drain. The burst may last 50 milliseconds or less, but during that window the queue overflows and packets are tail-dropped. By the time your SNMP poller arrives 60 or 300 seconds later, the burst is over, the queue has drained, and interface utilization has been averaged down to an unremarkable number.

What microbursts are and why minute averages hide them

Standard network monitoring is architecturally blind to this failure mode. SNMP polling at 30, 60, or 300 second intervals measures average throughput over the poll window. A 50ms burst at line rate contributes less than 0.1% to a 60-second average. NetFlow and IPFIX aggregate packets into flow records using cache timers measured in minutes, so the burst shape is blurred into a total byte count. Interface byte counters track what passed through the link, not whether the egress queue was momentarily full.

A microburst is defined not by its total volume but by its arrival rate relative to the egress queue’s drain rate. If traffic arrives at 100 Gbps on an interface that forwards at 25 Gbps, the queue must absorb the difference for the duration of the burst. Switch ASICs have finite buffer memory, often shared across ports. When the buffer fills, additional incoming packets are dropped at the tail of the queue.

The bottleneck is the queue buffer, not the physical link. An interface can drop packets due to buffer exhaustion without ever reaching 100% link utilization on a per-second basis, let alone on a per-minute average. Microbursts produce discards even when five-minute average utilization is low, and standard SNMP polling cannot distinguish a burst-induced discard from any other source of output drops.

flowchart TD
    A["Sub-second traffic burst"] --> B["Egress queue saturates"]
    B --> C["Tail drops begin (ifOutDiscards)"]
    C --> D["Burst ends, queue drains"]
    D --> E["SNMP poll arrives 60-300s later"]
    E --> F["Utilization averaged across
burst and idle period"] F --> G["Chart shows moderate load"] C --> H["Discards persist in counter"] G --> I["Drops at moderate utilization"] H --> I I --> J["Correlate with sub-second
queue telemetry to confirm"]

The ifOutDiscards counter increments during the burst and stays incremented, so the evidence that packet loss occurred is captured in SNMP polling. What is invisible is the cause: the momentary queue saturation that produced the drops. Without sub-second telemetry, the operator sees “interface dropping packets at 30% utilization” and cannot determine whether the cause is a microburst, a physical-layer issue, or a QoS misconfiguration.

Interface utilization metrics (ifHCInOctets and ifHCOutOctets against ifHighSpeed) measure bytes that successfully traversed the interface. They do not count packets dropped at the egress queue because the buffer was full. Dropped packets never become bytes in the utilization calculation, which is why utilization charts cannot detect microbursts.

On interfaces with QoS enabled, output discards may be concentrated in specific queue classes rather than distributed evenly. A port-level ifOutDiscards counter shows the aggregate but hides which queue is absorbing the loss. Vendor-specific QoS MIBs (such as CISCO-CLASS-BASED-QOS-MIB) expose per-queue drop counters, essential for identifying whether a specific traffic class is causing the burst or bearing the loss.

A secondary effect: microburst-induced traffic spikes can overwhelm flow telemetry collection. Flow exporters may drop records during microbursts because their export buffers overflow. One-minute polling for critical flow exporters is preferable to five-minute, which may miss burst-related drops entirely. The burst causes data-plane packet loss and can also cause monitoring-plane telemetry loss, creating a gap in the forensic record at the exact moment when visibility matters most.

Where microbursts show up in production

Microbursts are workload-dependent. They cluster around traffic patterns that produce many-to-one fan-in or synchronized transmission across multiple senders.

PatternWhat happensWhere to look
TCP incast / fan-inMany senders transmit simultaneously to one receiver (storage, map-reduce shuffle, scatter-gather queries)Receiver-facing switch port, egress queue on the receiver’s Top-of-Rack switch
AI/ML collective operationsAll-reduce or all-to-all gradient synchronization across GPUs produces synchronized bursts across multiple interconnectsGPU cluster fabric, inter-rack links
ECMP hash collisionsMultiple elephant flows hash to the same path in an ECMP group, oversubscribing one member linkMember interfaces of ECMP bundles, per-flow distribution analysis
Scheduled batch jobsCron-triggered log shipping, metrics export, or backup windows produce synchronized traffic across many hostsLinks carrying backup or replication traffic during scheduled windows
Storage replicationSnapshot transfers and replication streams saturate links in bursts as data is flushedStorage network interfaces, inter-switch links on the storage fabric

The common thread is temporal synchronization. When multiple independent senders transmit at the same instant, their aggregate arrival rate at a shared egress queue exceeds the queue’s drain rate even though each individual flow is modest. Each flow looks small and the link looks underutilized, but the instantaneous aggregate at the queue is the problem.

Detection methods and their tradeoffs

Detecting microbursts requires moving away from poll-based average measurement toward either direct queue monitoring or high-frequency sampling. Each approach has a distinct cost profile.

Per-second SNMP polling on critical interfaces. One-second polling narrows the averaging window enough to catch bursts lasting several hundred milliseconds. The tradeoff is collector and device load: per-second polling of ifHCInOctets, ifHCOutOctets, ifOutDiscards, and ifInDiscards across many interfaces increases SNMP agent CPU and risks scheduler fall-behind. Apply selectively to high-value or known-burst-prone interfaces, not fleet-wide.

Hardware queue and buffer telemetry. Modern switch ASICs expose queue occupancy and buffer utilization through vendor-specific mechanisms rather than standard MIBs:

  • Cisco Nexus 9000 microburst monitoring captures burst events using queuing policy-maps with rise and fall thresholds configured in bytes. The show queuing burst-detect command displays recorded bursts. clear queuing burst-detect resets the record buffer and destroys forensic evidence; use with caution. This feature is not available on all platforms within the Nexus 9000 family and stores a limited number of burst records before oldest entries are overwritten.
  • Arista LANZ (Latency Analyzer) generates event-driven exports when queue thresholds are exceeded, providing timestamped records of each burst event without continuous polling.
  • sFlow Broadcom BST extension (enterprise 4413, formats 1 and 2) exports peak buffer utilization statistics at the device and port level. Values represent peak buffer occupancy since the last export, designed for microburst trend analysis.

Hardware telemetry is the most direct detection method because it measures the actual resource that overflows. The tradeoff is vendor lock-in: each mechanism is ASIC and platform specific, and coverage across a multi-vendor estate requires normalization.

Streaming telemetry (gNMI). On platforms that support it, gNMI subscription to queue depth or buffer utilization counters pushes metrics at sub-second intervals without poll requests. This eliminates the SNMP averaging problem for the counters it covers.

Packet capture with burst detection. The ntop n2disk tool provides dedicated microburst detection:

# Detect microbursts on a 100 Mbit/s link at 90% threshold
# over a 10ms (10000 usec) window using PF_RING ZC capture
n2disk -i zc:eth1 -o /storage/ \
  --uburst-detection \
  --uburst-link-speed 100 \
  --uburst-threshold 90 \
  --uburst-win-size 10000 \
  --uburst-log /var/tmp/n2disk/uburst.log

This requires hardware-timestamped capture (PF_RING ZC or equivalent). Port mirroring via SPAN introduces buffering delays that alter burst timing; network TAPs are preferred for accurate sub-millisecond analysis.

Threshold configuration. When configuring rise and fall thresholds for hardware burst detection, set the fall-threshold to approximately 20% of the rise-threshold value. Setting the fall-threshold equal to or above the rise-threshold produces jitter and back-to-back burst records during normal queue drain behavior.

Storage cost. High-frequency telemetry increases data volume significantly compared to minute-level polling. One approach documented in research and education networks is adaptive monitoring: saving only pattern-change summaries rather than continuous high-frequency counters, reducing storage requirements while preserving burst detection capability.

Signals to watch

SignalWhy it mattersWarning sign
ifOutDiscards on critical interfacesPrimary SNMP-visible evidence that queue overflow occurred. Discards persist in the counter after the burst.Discards incrementing on an interface whose utilization chart shows moderate load
Per-queue drop counters (vendor QoS MIBs)Port-level discards hide which queue absorbed the loss. Per-queue counters identify the traffic class causing or suffering the burst.One queue class accumulating all drops while others remain clean
ifHCOutOctets at per-second granularityNarrows the averaging window enough to catch bursts lasting several hundred milliseconds.Brief utilization spikes near line rate between periods of moderate load
Device-side flow exporter dropsMicrobursts can cause flow export buffer overflow, creating telemetry gaps at the moment of the event.Exporter drop rate spiking during high-traffic windows
Udp_RcvbufErrors on flow collectorsA burst of flow records exported during or after a microburst can overwhelm the collector’s UDP receive buffer, causing silent telemetry loss.Buffer error counter incrementing in correlation with traffic spikes
Hardware buffer utilization (LANZ, BST, Nexus burst-detect)Direct measurement of queue occupancy rather than inference from byte counters. The only signal that proves the queue was the bottleneck.Buffer utilization peaks correlating with discard events

How Netdata helps

Netdata collects interface metrics at per-second granularity by default. ifOutDiscards increments and per-second utilization appear on the same timeline, making the microburst signature visible: drops at moderate average utilization.

Per-core CPU and softirq metrics help identify whether a flow collector is dropping packets during burst windows. Rising Udp_RcvbufErrors on a Netdata-monitored collector during a traffic spike is a direct indicator of burst-induced telemetry loss.

Custom metrics from vendor APIs or CLI scraping (streaming telemetry, LANZ events, Nexus burst-detect records) can be integrated alongside SNMP metrics into a unified per-second timeline, bridging standard polling and hardware-specific burst detection.