Trap and syslog flood from link flaps: surviving the storm

A single bad SFP starts flapping. Within seconds, your trap receiver is processing hundreds of linkDown/linkUp pairs per second, your syslog pipeline is drowning in LINK-3-UPDOWN messages, and STP topology change notifications are cascading across the L2 domain. The kernel socket buffer on UDP 162 overflows, and the root-cause hardware alarm is as likely to be dropped as any other datagram in the flood.

The core problem is architectural. Traps and syslog arrive over UDP, a lossy transport with no retransmission. When a receiver socket buffer fills, the kernel silently discards datagrams. There is no per-source accounting. You cannot tell which device’s traps were dropped, only that some were. During a flood, any datagram arriving during the overflow window can vanish, including the one that matters most.

What this means

A link-flap cascade follows a predictable path. A physical-layer fault (bad cable, dirty fiber, failing transceiver, end-station NIC bug, power fluctuation) causes an interface to transition up and down rapidly. Each transition generates a linkDown trap and, moments later, a linkUp trap. On switching platforms, each transition also fires STP topology change notifications as the bridge re-evaluates the forwarding topology. The syslog stream fills with interface state change messages.

A single flapping access port can generate dozens of trap pairs per second. A cascade across multiple ports (STP instability, power event hitting a whole rack) multiplies this by the number of affected interfaces. The trap receiver on UDP 162 and the syslog receiver on UDP 514 share the same kernel UDP infrastructure. When the burst exceeds the socket buffer drain rate, datagrams are silently discarded.

There is no priority queueing for incoming UDP traps. Every datagram that arrives while the socket buffer is full is equally likely to be dropped, regardless of severity. Recovery confirmations, unrelated critical alarms, and the root-cause event itself all face the same odds during the overflow window.

flowchart TD
    A["Bad SFP / dirty fiber / NIC bug"] --> B["Interface flaps rapidly"]
    B --> C["linkDown/linkUp trap pairs"]
    B --> D["LINK-3-UPDOWN syslog messages"]
    B --> E["STP topology change notifications"]
    C --> F["Trap receiver UDP 162 flooded"]
    D --> G["Syslog receiver UDP 514 flooded"]
    F --> H["Kernel socket buffer overflows"]
    G --> H
    H --> I["Root-cause trap silently dropped"]
    I --> J["Operators see noise, miss signal"]

Common causes

CauseWhat it looks likeFirst thing to check
Bad cable, SFP, or dirty fiberSingle interface flapping with incrementing ifInErrorsifInErrors rate and physical-layer syslog
End-station NIC bug or floodingAccess port flapping, no physical errors on the switch sideMAC behavior on the port, end-station driver logs
Power fluctuation to end deviceIntermittent link-loss on one port, often time-correlated with other devices on same PDUPower infrastructure, UPS logs
Speed or duplex mismatchVery high error rates on the interface, carrier transitionsInterface speed/duplex negotiation on both ends
STP instability causing reconvergenceMultiple interfaces transitioning, TCN count rising rapidlydot1dStpTopChanges and root bridge identity
LACP timer mismatchLACP bundle member flapping, especially under CPU loadLACP periodic timer setting (SLOW vs FAST)

Quick checks

# Check trap receive rate. Format depends on your snmptrapd logging config;
# adjust the awk field index to match your delimiter/layout.
awk -F'|' '{print $4}' /var/log/snmptrapd.log | sort | uniq -c | sort -rn | head

# Check syslog rate for LINK-3-UPDOWN flood
grep -c 'LINK-3-UPDOWN' /var/log/network-devices/*.log

# Check kernel UDP socket buffer drops (system-wide, all UDP ports)
cat /proc/net/snmp | grep '^Udp:'
nstat -az Udp_RcvbufErrors

# Check current trap receiver socket fill level
ss -lun '( sport = :162 )' -m

# Capture live trap traffic to see what is arriving
tcpdump -i eth0 -nn 'udp port 162' -c 100

# Check syslog receiver socket fill level
ss -lun '( sport = :514 )' -m

# Identify the flapping interface via SNMP (replace community and host)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.8   # ifOperStatus

# Check input errors on the suspected interface
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.14  # ifInErrors

# Check STP topology change count
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.17.2.4    # dot1dStpTopChanges

How to diagnose it

  1. Identify the flapping interface. Parse the trap log for interface index values in linkDown/linkUp varbinds. The IF-MIB interface index maps to a physical port via ifDescr or ifName. If traps are being dropped (check Udp_RcvbufErrors), fall back to polling ifOperStatus directly.

  2. Confirm it is a flap, not a single transition. A linkDown/linkUp pair frequency above 1 per second on any interface is a flap. More than 5 transitions per minute warrants investigation. Cross-reference with ifInErrors to distinguish physical-layer faults from logical issues.

  3. Check for STP impact. If dot1dStpTopChanges is incrementing rapidly in correlation with the flap, the L2 domain is reconverging. This widens the blast radius beyond the single port.

  4. Verify receiver health. Check Udp_RcvbufErrors on the trap and syslog collector. Any nonzero increment during the event means datagrams were dropped. There is no per-source attribution. If the counter advanced, assume critical traps may be missing.

  5. Check for correlated events that may have been lost. If the flap coincided with a BGP session drop, a hardware alarm, or a config change, those traps or syslog messages may have been dropped in the flood. Poll the relevant OIDs directly to reconstruct state.

  6. Patch snmptrapd. A buffer overflow vulnerability in net-snmp’s snmptrapd could allow a remote attacker to crash the daemon via crafted trap packets. A trap flood is an ideal delivery vector. If your snmptrapd is unpatched, the flood may take down the receiver entirely rather than just dropping packets.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Trap receive rateSudden spike indicates an event cascadeRate 10x or more above rolling baseline
linkDown/linkUp pair frequencyIdentifies the flapping interfaceMore than 1 pair per second on any interface
Udp_RcvbufErrorsSilent datagram loss at the kernel socket bufferAny nonzero increment during a flood
Syslog receive rate and severityParallel flood on UDP 514Rate spike without severity escalation means noise, not a real event
ifOperStatus transitionsConfirms the flap via polling when traps are unreliableMore than 3 transitions per minute
ifInErrorsPhysical-layer root cause indicatorAny sustained increment
STP topology change count (dot1dStpTopChanges)Measures L2 blast radiusRapid increment correlated with the flap
Device control-plane CPUTrap generation and processing consumes device CPUSustained above 70% during the event
Trap source diversityA single sender dominating trap volume is a findingOne device producing more than 50% of all traps

Fixes

Stop the flood at the source

The fastest mitigation is to administratively shut down the flapping interface. This stops trap generation immediately and lets the receiver drain its backlog.

# Identify flapping interfaces (exec mode)
ssh <device> 'show interfaces | include protocol.*down|reset|err-disable'
! Disruptive: shuts down the interface. Required config-mode commands:
device# configure terminal
device(config)# interface <if>
device(config-if)# shutdown
device(config-if)# end

This is disruptive to whatever is connected to that port, but it preserves visibility for the rest of the network. If the port is an access port serving a single end station, the tradeoff is almost always correct.

On Cisco IOS and IOS XE, an interface that flaps more than 5 times within 10 seconds enters errdisable state. Both a syslog message and an SNMP trap are sent upon port shutdown. Enable automatic recovery so the port does not require manual intervention:

device# configure terminal
device(config)# errdisable recovery cause link-flap
device(config)# errdisable recovery interval 300
device(config)# end
# View current flap thresholds and recovery config (exec mode)
ssh <device> 'show errdisable flap-values'
ssh <device> 'show errdisable recovery'

The default 300-second recovery interval gives the physical issue time to stabilize. If the port flaps again after recovery, it re-enters errdisable. This is a self-healing mechanism that prevents sustained floods.

Arista EOS detects continuously flapping interfaces and temporarily holds the link-down state to smooth out flap churn. If you are running a recent EOS version, verify it is enabled on access ports.

Set LACP periodic timer to SLOW (Juniper EX)

On Juniper EX2300 and EX3400 series, LACP bundles can flap during CPU-intensive events such as routing engine switchover or interface flaps elsewhere on the device. Setting the LACP periodic timer to SLOW reduces CPU overhead and prevents spurious member-link flaps that trigger trap floods.

Rate-limit syslog and traps on the device

Cisco IOS provides a global syslog rate-limit that caps messages per second.

device# configure terminal
device(config)# logging rate-limit 10
device(config)# end

For per-message rate limiting, use a logging discriminator to suppress or throttle specific message types:

device(config)# logging discriminator LINKFLAP rate-limit 5
device(config)# logging discriminator LINKFLAP msg-body contains "LINK-3-UPDOWN"
! Apply the discriminator to a logging destination:
device(config)# logging host <host> discriminator LINKFLAP

For trap suppression on specific interfaces, use per-interface link-status trap control. The interface-level command is no snmp trap link-status on most IOS and IOS XE trains. The global form is also available:

! Suppress linkup/linkdown traps globally
device(config)# no snmp-server enable traps snmp linkup linkdown

Suppressing linkup/linkdown traps globally is aggressive. It eliminates the signal entirely. A better approach for access ports is per-interface suppression while keeping traps enabled on uplinks and critical infrastructure ports.

Harden the receiver

Tune the UDP socket buffer on the collector to absorb bursts.

# Increase UDP socket buffer limits (requires root)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.rmem_default=16777216

# Verify the trap listener is using the larger buffer
ss -lun '( sport = :162 )' -m

Apply these persistently via sysctl configuration (for example, /etc/sysctl.d/). If the collector process sets SO_RCVBUF explicitly, it must request a value at or below rmem_max. If it does not set SO_RCVBUF, it inherits rmem_default, which is why both sysctl keys matter.

Patch snmptrapd against known buffer overflow vulnerabilities. During a link-flap flood, a malicious trap crafted to exploit a vulnerability would be indistinguishable from flood noise.

Prevention

Enable errdisable link-flap detection globally on all Cisco switching platforms. This converts a sustained flap into a single shutdown event with one trap pair instead of hundreds.

Configure syslog rate-limiting on all devices. Prevents the syslog pipeline from being overwhelmed during any event burst.

Suppress linkup/linkdown traps on access ports where the physical state is not operationally critical. Keep traps enabled on uplinks, peer-facing interfaces, and any port in the critical path.

Monitor Udp_RcvbufErrors on trap and syslog collectors continuously. Any nonzero increment means data was lost. This is the single most under-monitored signal in trap and syslog collection.

Watch trap source diversity. If a single device is producing more than 50% of all traps over any sustained window, that device is likely in distress. Alert on this ratio before the flood reaches the receiver.

Be aware of platform-specific blind spots. On Cisco IOS XE, console trap display may be suppressed by interval-based message limiting during a storm, meaning an operator watching the console may miss the flood entirely.

Ensure NTP synchronization on all devices. During a flood, postmortem correlation depends on accurate timestamps. Two devices 200ms apart on the same flap will produce trap and syslog records that fail to align in reconstruction.

How Netdata helps

Netdata correlates the signals that matter during a trap flood, providing cross-layer visibility that raw trap logs cannot:

  • Trap receive rate anomaly detection surfaces the spike within seconds, before the receiver buffer overflows.
  • Udp_RcvbufErrors monitoring on the collector catches silent datagram loss that would otherwise go unnoticed until a postmortem fails to reconstruct the event.
  • Syslog ingestion rate and severity distribution distinguish a noise flood (rate spike without severity escalation) from a real event cascade (rate spike with severity escalation).
  • ifOperStatus transition tracking via SNMP polling confirms the flapping interface even when traps are being dropped, providing a ground-truth fallback.
  • Device control-plane CPU monitoring during the event identifies devices under stress from trap generation and STP recalculation.
  • STP topology change correlation with interface state transitions reveals the L2 blast radius of the flap in real time.