Interface discards with low utilization: diagnosing ifInDiscards/ifOutDiscards
ifOutDiscards is climbing on a critical uplink. Utilization sits at 35%. No CRC errors, no input errors, no physical-layer alarms. The link is up and passing traffic, but something is silently dropping packets, and your averaged utilization metrics are not telling you why.
The gap between what the counters show and what the silicon is doing comes down to two things: the averaging window on utilization, and the fact that discards happen at buffer-queue granularity, not at link-rate granularity.
The IF-MIB gives you the signals you need. The challenge is knowing what those counters actually measure, where their blind spots are, and which vendor-specific counters fill the gaps.
What this means
ifInDiscards (OID .1.3.6.1.2.1.2.2.1.13) and ifOutDiscards (.1.3.6.1.2.1.2.2.1.19) are IF-MIB counters defined in RFC 2863. They count packets that the interface chose to discard to free buffer space, even though no errors were detected. They are distinct from ifInErrors (.1.3.6.1.2.1.2.2.1.14) and ifOutErrors (.1.3.6.1.2.1.2.2.1.20), which count frames with hardware-detected problems such as CRC failures, runts, or alignment errors.
Both discard counters are Counter32. Before trusting any rate calculation, check ifCounterDiscontinuityTime (.1.3.6.1.2.1.31.1.1.1.3) to rule out counter resets from device reboots, interface flaps, or manual counter clears. A naive differencing algorithm that does not handle wrap or discontinuity will produce phantom discard spikes that look exactly like real ones.
The core diagnostic problem is temporal resolution. Utilization is computed as 8 * (delta octets) / (delta time * ifHighSpeed) using 64-bit HC counters (ifHCInOctets at .1.3.6.1.2.1.31.1.1.1.6, ifHCOutOctets at .1.3.6.1.2.1.31.1.1.1.10). But this is an average over the polling interval or the device load interval. On Cisco IOS/IOS-XE, the default load interval is 300 seconds and is adjustable from 30 to 600 seconds. A microburst that fills the egress buffer for 50 milliseconds is invisible in a 30-second average, let alone a 5-minute one. A 10G interface receiving traffic from four independent 1G sources can exhaust its egress queue in milliseconds if those flows arrive simultaneously, even though the 30-second average never exceeds 40% utilization.
That is the signature pattern: discards at low sustained utilization almost always means microbursts.
flowchart TD
A["Discards incrementing on interface"] --> B{"ifCounterDiscontinuityTime changed?"}
B -- "Yes" --> C["Counter reset: spike may be calculation artifact"]
B -- "No" --> D{"Utilization sustained above 80%?"}
D -- "Yes" --> E["Capacity exhaustion: link is saturated"]
D -- "No" --> F{"ifInErrors or ifOutErrors also rising?"}
F -- "Yes" --> G["Physical-layer fault: cable, SFP, optics"]
F -- "No" --> H{"Single queue class affected?"}
H -- "Yes" --> I["QoS buffer threshold or policer drop"]
H -- "No" --> J["Microburst: sub-second spike invisible to averaged counters"]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Microburst congestion | ifOutDiscards rising, utilization under 70%, no errors | Per-queue drop counters via vendor QoS MIBs or CLI |
| Speed or duplex mismatch | Discards on one side, errors on the other | Interface speed and duplex negotiation on both ends |
| QoS buffer threshold | Discards concentrated in one queue class | Per-queue stats via vendor-specific commands |
| Input policer or ACL | ifInDiscards rising, no corresponding output drops | Applied policies and ACL hit counters |
| Counter wrap or discontinuity | Sudden massive spike with no traffic correlation | ifCounterDiscontinuityTime and sysUpTime |
| Undersized device buffer | Discards correlate with aggregate traffic, not one flow | Platform buffer allocation settings |
Quick checks
# Poll discard counters via SNMP
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.13 # ifInDiscards
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.19 # ifOutDiscards
# Check for counter discontinuity (should be 0 or stable)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.3
# Poll error counters to distinguish drops from physical faults
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.14 # ifInErrors
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.20 # ifOutErrors
# Check utilization using 64-bit HC counters
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.6 # ifHCInOctets
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.10 # ifHCOutOctets
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.15 # ifHighSpeed
# On Cisco IOS/IOS-XE: detailed drop and queue breakdown
ssh <device> 'show interface <iface> | include drop|queue|buffer'
ssh <device> 'show platform hardware fed active qos queue stats interface <iface>'
How to diagnose it
Rule out counter artifacts. Poll ifCounterDiscontinuityTime. If it changed since the last poll, the discard spike may be a calculation artifact from a counter reset, not real drops. Also check sysUpTime (.1.3.6.1.2.1.1.3.0) to confirm the device did not reboot.
Distinguish discards from errors. Poll ifInErrors and ifOutErrors alongside the discard counters. If errors are also rising, the problem is physical-layer: cable, SFP, dirty fiber, or duplex mismatch. Focus on the physical path, not the buffer queue.
Confirm the utilization gap. Compute utilization from ifHCInOctets, ifHCOutOctets, and ifHighSpeed. If utilization is genuinely low (under 70%) and discards are rising, you are looking at microbursts, QoS policy drops, or an input policer. If utilization is actually high (above 80%), the link is saturated and the low utilization reading was an artifact of a long load interval.
Shorten the load interval. On Cisco, set
load-interval 30on the affected interface in configuration mode. This is non-disruptive (affects statistics only, not forwarding) and tightens the averaging window from the 300-second default to 30 seconds. It will not reveal millisecond-scale bursts, but it catches multi-second spikes that a 5-minute interval hides.Inspect per-queue drops. The port-level ifOutDiscards counter aggregates all queue classes. On modern ASICs, buffer allocation is per-queue, not per-port. Use vendor-specific commands to see which queue is dropping:
- Cisco Catalyst 9000:
show platform hardware fed active qos queue stats interface <iface>shows per-queue enqueue and drop thresholds (TH0, TH1, TH2). - Arista EOS:
show queue-monitor length(LANZ) provides real-time egress queue depth, enabled by default on supported platforms.
On Cisco IOS/IOS-XE, ifInDiscards counts only “No Buffer Drops,” while the legacy counter locIfInputQueueDrops equals “Queue Limit Drops + No Buffer Drops.” So ifInDiscards is a proper subset of the input queue pressure the device actually experienced. For output, locIfOutputQueueDrops equals ifOutDiscards. Your SNMP ifInDiscards value may undercount the real input queue problem.
- Cisco Catalyst 9000:
Capture the burst. Embedded Packet Capture (EPC) on Cisco is unsuitable for microburst analysis because it caps capture throughput at a rate well below line rate. Use a TX-only SPAN of the affected interface instead, collected while drops are actively incrementing. Source and destination SPAN ports must have the same or higher speed; otherwise the SPAN session introduces its own drops.
Check for speed mismatch. A 10G interface connected to a 1G neighbor will drop on the egress side whenever the remote cannot absorb at line rate. This is independent of overall average load. Verify interface speed and duplex on both ends.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| ifOutDiscards rate | Leading indicator of egress buffer exhaustion | Any nonzero sustained rate on a critical interface |
| ifInDiscards rate | Input queue overflow or policer and ACL drops | Sustained nonzero rate |
| ifHCInOctets / ifHCOutOctets | Utilization computation; must use 64-bit for links at or above 100 Mbps | Sustained above 80% indicates genuine capacity issue |
| ifInErrors / ifOutErrors | Distinguishes physical-layer faults from buffer drops | Any nonzero rate changes the diagnosis from buffer tuning to hardware investigation |
| ifCounterDiscontinuityTime | Validates counter continuity before rate calculation | Any change between polls invalidates the delta |
| Per-queue drop counters | Reveals which QoS class is actually dropping | Single queue accounting for most port-level drops |
| sysUpTime | Correlates counter resets with device reboots | Decrease between polls indicates reboot event |
Fixes
Microburst congestion
The fundamental fix is to increase available buffer or reduce burstiness. On platforms that support it, increase the buffer allocated to the affected queue. On Cisco Catalyst 9000, adjust the buffer ratio per class using queue-buffers ratio <0-100> inside the policy-map class configuration. On platforms with intra-ASIC buffer sharing (Cisco UADP 3.0-based Catalyst 9500 HP and 9600, from IOS XE 17.2.1), enable qos share-buffer in global configuration to allow AQM buffers to be shared between ASIC cores, which reduces microburst-induced discards on multi-core designs.
If buffer tuning is insufficient, the traffic pattern itself may need shaping. Deploy ingress shaping or scheduling changes to smooth the burst before it reaches the congested egress interface.
Speed or duplex mismatch
Verify autonegotiation results on both ends. A speed mismatch (10G feeding 1G) creates inherent egress drops on the faster side. Fix by matching interface speeds, deploying shaping to the slower rate, or upgrading the remote to match.
QoS policy drops
If discards are concentrated in a specific queue class, the QoS policy may be doing exactly what it was configured to do. Evaluate whether the drop rate is expected for that traffic class. If not, adjust the queue buffer ratio or the policer rate for that class.
Counter wrap
Use 64-bit HC octets counters (ifHCInOctets, ifHCOutOctets) for utilization on links at or above 100 Mbps. The 32-bit ifInOctets wraps in approximately 3.4 seconds at 10G line rate, 34 seconds at 1G, and 5.7 minutes at 100M. No 64-bit discard counters exist in IF-MIB. Ensure your polling system detects Counter32 wrap via ifCounterDiscontinuityTime and handles Counter32 arithmetic for discard rate calculations.
Prevention
- Poll at the shortest interval your collector can sustain. Five-minute polling misses microbursts entirely. One-minute polling catches multi-second spikes. Sub-second bursts remain invisible to SNMP polling regardless of interval.
- Use 64-bit HC counters for utilization. Never use 32-bit ifInOctets or ifOutOctets for links at or above 100 Mbps.
- Set load intervals to 30 seconds on critical interfaces to tighten the averaging window and reduce the gap between what the chart shows and what the buffer experienced.
- Monitor per-queue drop counters, not just port-level aggregates. Port-level ifOutDiscards hides which traffic class is affected.
- Track ifCounterDiscontinuityTime alongside discard counters. A counter reset without a corresponding sysUpTime reset indicates a counter-source bug or interface flap.
- Baseline discard behavior during normal operation. On Catalyst 9000, output drops may be reported in bytes by default, not packets. Calculate the ratio: (total output drops) / (total output bytes transmitted) x 100. A value below 0.01% over a multi-week counter lifetime is typically transient microburst noise rather than a sustained problem.
How Netdata helps
Netdata collects interface-level SNMP counters including ifInDiscards, ifOutDiscards, ifInErrors, ifOutErrors, and 64-bit HC octets counters. The value for this scenario is correlation:
- Discard rate against utilization. Overlay ifOutDiscards delta against ifHCOutOctets-derived utilization on the same chart. Discards climbing while utilization stays low is the microburst signature.
- Discards against errors. Correlate ifInDiscards with ifInErrors on the same interface. Errors rising alongside discards shifts the investigation from buffer tuning to physical-layer inspection.
- Counter discontinuity detection. Netdata tracks counter continuity across polling intervals, filtering phantom spikes from wraps or resets.
- Per-interface alerting with context. Configure discard-rate alerts per interface criticality. A rising discard rate with positive second derivative (accelerating drops) signals congestion cascade risk before impact manifests.
Related guides
- ARP cache staleness: when IP-to-MAC mapping goes bad
- Asymmetric routing: why your path and latency measurements lie
- Audit log gaps: detecting syslog/trap tampering or loss
- BGP flapping: why a peer keeps resetting and how to find the cause
- BGP NOTIFICATION and Cease messages: what each subcode is telling you
- BGP RIB and FIB growth: monitoring route-table size before it bites
- BGP route leak and hijack: the detection signals and alerts that matter
- BGP session Established but stale: detecting silent route loss
- Correlating cloud VPC flow logs with on-prem NetFlow
- Cold-start topology: why your map is incomplete after a collector restart
- Collector CPU and TSDB write-queue saturation: the capacity signals
- NIC RSS misconfiguration: one CPU core silently dropping your telemetry







