Interface input/output errors: finding the bad link with ifInErrors/ifOutErrors

A switch port reports 47,000 input errors and climbing. The interface is up and traffic is flowing. Your monitoring fired an alert on ifInErrors crossing threshold. Now you need to determine whether this is a dirty fiber, a dying SFP, a duplex mismatch, buffer exhaustion, or a counter artifact from an interface flap.

ifInErrors (.1.3.6.1.2.1.2.2.1.14) and ifOutErrors (.1.3.6.1.2.1.2.2.1.20) are aggregate counters. They tell you something is wrong, but not what. The counter is a sum of multiple error types: CRC, alignment, runts, giants, overruns, frame errors, and on some platforms, input drops. A frame that arrives with both a CRC error and a runt condition increments ifInErrors by exactly 1, not 2. You cannot reconcile the sub-counters against the aggregate by simple addition.

What this means

ifInErrors counts inbound packets that contained errors preventing delivery to a higher-layer protocol. ifOutErrors counts outbound packets that could not be transmitted due to errors. Both are Counter32 values defined in IF-MIB (RFC 2863).

On Ethernet-like interfaces, the sub-counters in dot3StatsTable (RFC 3635) provide the breakdown. ifInErrors typically includes alignment errors, FCS (CRC) errors, frame-too-long errors, and internal MAC receive errors. ifOutErrors typically includes late collisions, excessive collisions, internal MAC transmit errors, and carrier sense errors.

Vendor implementations may include additional sub-counters. Cisco IOS, for example, folds runts, giants, no-buffer, overrun, and ignored into the input error total.

There are no 64-bit high-capacity error counters. Unlike octet and packet counters, which have ifHCInOctets and ifHCOutOctets counterparts in ifXTable, error counters remain Counter32 even on 100GE interfaces. On high-speed or high-error-rate interfaces, poll frequently enough to avoid missing wrapped increments between samples.

Rate-of-change matters more than absolute value. A stable counter with a high absolute value from an old event is not actionable. An incrementing counter at any rate is actionable. Alert on the delta, not the total.

Common causes

CauseWhat it looks likeFirst thing to check
Dirty fiber or cable degradationLow, steady CRC error rate (1e-6 to 1e-4 of packets); no utilization correlationInspect and clean fiber end faces; check connector seating
SFP or optic failureSudden step-increase in CRC errors on previously clean link; may correlate with temperatureSwap the optic or move to a spare port
Duplex mismatchVery high error rate (0.1 to 1 percent); errors may appear without explicit CRC; often asymmetricVerify speed and duplex settings on both ends of the link
MTU misconfiguration (tunnels)Input errors with zero CRC and zero frame on tunnel or sub-interfacesVerify MTU consistency across the tunnel path
Receive buffer exhaustion (overrun)High overrun count with zero CRC and zero frame; often on firewalls or ISR platformsCheck ingress queue depth, broadcast rate, and control-plane CPU
EMI on copper runsIntermittent CRC and alignment errors correlating with environmental factorsInspect cable routing near power sources; replace with shielded cable
ASIC faultOutput errors or errors on multiple ports simultaneously without physical-layer causeEngage vendor TAC; prepare for RMA

Quick checks

All SNMP commands below are read-only and safe to run in production.

# Poll ifInErrors for all interfaces
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.14

# Poll ifOutErrors for all interfaces
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.20

# Get interface names for context
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.1

# Detailed per-interface error breakdown (Cisco CLI; syntax varies by platform)
ssh <switch> 'show interfaces counters errors'

# Check ifInDiscards separately (some platforms fold drops into errors)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.13

# Check ifCounterDiscontinuityTime to detect counter resets or flaps
snmpwalk -v2c -c <community> <device> IF-MIB::ifCounterDiscontinuityTime

# Poll Ethernet-like sub-counters (CRC, alignment, frame-too-long)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.10.7.2

# Verify interface utilization for correlation
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.6   # ifHCInOctets
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.15  # ifHighSpeed

How to diagnose it

  1. Confirm the counter is actually incrementing. Poll ifInErrors twice with a known interval, for example 60 seconds. Compute the delta. A non-zero delta confirms active errors, not a stale counter from an old event.

  2. Check for counter discontinuity. Poll ifCounterDiscontinuityTime in ifXTable. If the value is non-zero or changed since the last poll, the interface flapped, the module was swapped, or the SNMP engine restarted. Discard the delta across the discontinuity window to avoid a false spike from the counter resetting.

  3. Break down the aggregate into sub-counters. Use vendor CLI (show interfaces counters errors on Cisco, equivalent on other platforms) or poll the Ethernet-like statistics table (dot3StatsTable at .1.3.6.1.2.1.10.7.2) for CRC, alignment, frame-too-long, and internal MAC errors. The sub-counter breakdown tells you which error type dominates.

  4. Classify the error pattern. Match the dominant sub-counter to a failure mode:

    • CRC or FCS errors dominant: physical-layer fault (cable, fiber, SFP, EMI).
    • Overrun dominant with zero CRC: receive buffer exhaustion, not cabling.
    • Input errors with no explicit CRC sub-counter and high rate: duplex mismatch.
    • Input errors on tunnel or sub-interfaces with zero CRC: MTU mismatch.
    • Output errors (collisions, carrier sense): check the local transmit path and remote-end negotiation.
  5. Correlate with utilization. Errors at low utilization point to physical-layer faults. Errors that scale with utilization point to congestion, buffer exhaustion, or duplex mismatch. Poll ifHCInOctets and ifHighSpeed to compute utilization alongside the error rate.

  6. Check ifInDiscards separately. Some platforms include queue overflows in ifInErrors; others separate them in ifInDiscards (.1.3.6.1.2.1.2.2.1.13). If your platform folds drops into errors, the “error” may actually be a congestion drop, not a physical-layer fault. Check vendor documentation for counter composition.

  7. Check syslog for errdisable or hardware events. A port that is erroring may be approaching an errdisable threshold. Correlate error spikes with syslog messages for port security, STP, or hardware alarms.

flowchart TD
    A["ifInErrors delta > 0"] --> B["Check ifCounterDiscontinuityTime"]
    B --> C{"Discontinuity detected?"}
    C -->|Yes| D["Discard delta; re-baseline"]
    C -->|No| E["Break down into sub-counters"]
    E --> F{"CRC or FCS > 0?"}
    F -->|Yes| G["Physical-layer fault"]
    F -->|No, overrun dominates| H["Receive buffer exhaustion"]
    F -->|No sub-counter matches| I{"Errors scale with utilization?"}
    I -->|Yes| J["Duplex mismatch or congestion"]
    I -->|No| K["Check MTU and tunnel config"]
    G --> L["Inspect cable, fiber, SFP"]
    H --> M["Check ingress queue, broadcast rate"]
    J --> N["Verify speed and duplex both ends"]
    K --> O["Verify MTU on tunnel endpoints"]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ifInErrors rate of change (delta per interval)Primary indicator of active link degradationAny sustained non-zero delta on a critical interface
ifOutErrors rate of changeOutput errors often indicate ASIC or transmit-path faultsAny sustained non-zero delta
Error rate as percentage of packet rateNormalizes error count against traffic volumeSustained above 0.01% on a critical interface
ifInDiscards and ifOutDiscardsSeparates congestion drops from physical errorsRising in correlation with errors may indicate buffer exhaustion
Interface utilizationErrors at high utilization suggest congestion; at low utilization suggest physical faultsStep change in utilization coinciding with error onset
ifCounterDiscontinuityTimeDetects counter resets that produce false error spikesNon-zero value or value change between polls
Syslog errdisable eventsPorts may be disabled by the platform when error thresholds are crossedErrdisable message on the same interface
dot3Stats sub-counters (CRC, alignment, overrun)Breaks the aggregate into actionable componentsAny non-zero sub-counter on a previously clean interface

Fixes

Physical-layer faults (CRC, alignment, frame errors)

Clean fiber connectors with appropriate cleaning tools. Reseat SFP modules and verify insertion. For copper, inspect for kinked or damaged cable runs and verify termination. If errors persist after cleaning and reseating, swap the optic or move the link to a spare port to isolate whether the fault is in the cable plant or the port hardware.

Swapping optics or moving to a spare port requires a maintenance window on production links, since it interrupts connectivity.

Duplex mismatch

Duplex mismatch produces very high error rates, often 0.1 to 1 percent of packets. It typically occurs when one end of a link is hardcoded to full-duplex and the other is set to autonegotiate. Fix by ensuring both ends use the same negotiation mode. If you must hardcode, hardcode both ends to the same speed and duplex. Document the configuration on both ends, since hardcoding removes the link’s ability to adapt to a replaced remote device with different defaults.

MTU misconfiguration

Tunnels with mismatched MTU settings can generate input errors with zero CRC and zero frame counts. Verify that the MTU is consistent across the entire path, including tunnel encapsulation overhead. Enable ICMP fragmentation-needed handling or configure TCP MSS clamping if path MTU discovery is broken. Scope TCP MSS clamping to the tunnel interface only, since applying it globally reduces maximum segment size for all transit traffic.

Receive buffer exhaustion

High overrun counts with zero CRC indicate that the receive path is dropping packets because buffers are full, not because frames are corrupted. This points to broadcast storms, undersized ingress queues, or control-plane CPU saturation. Check broadcast and multicast rates on the interface. Increase ingress queue depth if the platform allows it. Address the root cause of the traffic burst rather than just increasing buffer size, since larger buffers increase latency for all traffic on the interface.

ASIC or port hardware fault

Output errors on multiple ports simultaneously, or errors that persist after all physical-layer checks pass, suggest an ASIC fault. Engage vendor support and prepare for an RMA. Hardware replacement requires a maintenance window and may require reconfiguration if the replacement uses a different port layout.

Prevention

  • Alert on error rate, not absolute counter. Compute errors as a percentage of total packets (ifInErrors relative to unicast, multicast, and broadcast packet counts). A flat threshold on the raw counter produces false positives on high-traffic links and false negatives on low-traffic links.
  • Track counter discontinuity. Poll ifCounterDiscontinuityTime between samples and discard deltas where the value changes. This prevents false spikes from interface flaps or device reboots.
  • Baseline error rates per interface class. Uplinks, access ports, and tunnel interfaces have different expected error profiles. Baseline separately and alert on deviation from baseline, not on absolute values.
  • Include sub-counter polling for critical links. Poll the dot3Stats table for CRC, alignment, and overrun sub-counters alongside the aggregate. Early detection of a rising CRC sub-counter on a clean link is more actionable than a generic “input errors increased” alert.
  • Poll frequently enough for Counter32. Error counters are always 32-bit. On high-speed interfaces, poll at least every 60 seconds to reduce the chance of missing wrapped increments between samples.
  • Correlate errors with discards and utilization. Do not alert on errors in isolation. Correlate with ifInDiscards and interface utilization to distinguish physical faults from congestion.

Correlation with Netdata

Netdata collects ifInErrors, ifOutErrors, dot3Stats sub-counters, ifInDiscards, and interface utilization on the same per-interface timeline, so you can see whether errors track traffic volume or occur independently. Deltas are computed on the raw Counter32 values, and deltas across ifCounterDiscontinuityTime changes are suppressed to prevent false spikes from interface flaps or reboots.

Alerts fire on sustained non-zero rate-of-change rather than the raw counter value, avoiding noise from stale historical counts. For per-interface baselining, Netdata learns expected error rates and surfaces anomalies relative to each interface’s own baseline.