Stale FDB/MAC tables: why endpoint location is wrong
Your topology platform says endpoint aa:bb:cc:dd:ee:ff is on switch port Gi1/0/24. Your security team sends someone to that port. The endpoint is not there. It moved hours ago, or it went offline, or it vMotioned to a different host. The FDB entry was stale and the platform presented it as current.
The Forwarding Database (FDB), also called the MAC address table or CAM table, maps MAC addresses to switch ports. Topology inference engines use FDB data, cross-referenced with ARP tables and CDP/LLDP neighbor data, to deduce where endpoints are physically connected. The inference is probabilistic. It degrades as input data freshness degrades.
The core problem: standard MIBs do not expose how long ago an entry was last refreshed. Staleness must be inferred from polling deltas, and most teams do not compute it. The result: security investigations go to the wrong switch port, the wrong building, the wrong rack. Positioning queries return wrong answers with high confidence.
What this means
An FDB entry is learned when the switch receives a frame with a source MAC on a port. It ages out after a configurable timer with no traffic from that MAC on that port. If the MAC reappears on a different port, a new entry is learned and the old entry is overwritten or removed.
The failure occurs in the gap between “entry aged out” and “topology engine noticed.” During that gap, the topology engine still has the old entry (or no entry) in its cache. If the endpoint moved, the engine does not know the new location. If the endpoint went offline, the engine thinks it is still connected.
The ARP table adds a second layer of staleness. ARP maps IP addresses to MAC addresses, and ARP timeouts are typically much longer than MAC aging timeouts. This asymmetry is the core problem. On Arista EOS, MAC aging is 5 minutes (300 seconds) but ARP timeout is 4 hours (14400 seconds). On Cisco IOS, ARP timeout is commonly 4 hours. On Linux, the neighbor cache depends on several sysctl parameters (gc_stale_time, gc_thresh1/2/3, base_reachable_time); entries can persist well beyond the MAC aging timer in practice. A quiet host can lose its MAC-to-port mapping while its IP-to-MAC mapping persists for hours. The switch knows the IP belongs to a MAC but has forgotten which port that MAC is on. Unicast frames flood until the entry is relearned.
Platform-specific MAC aging defaults also vary. Cisco Catalyst and IOS typically default to 300 seconds. Cisco Nexus 7000 series defaults to 1800 seconds (30 minutes), meaning stale entries linger six times longer than on Catalyst. Linux bridge default aging time is 300 seconds, configurable via ip link set dev <br> type bridge ageing_time <seconds>. Setting this value to 0 disables aging entirely (entries never expire), which is a common misconfiguration that causes permanent staleness. Open vSwitch uses ovs-vsctl set bridge <br> other_config:mac-aging-time=<seconds>.
flowchart TD
A["Endpoint goes quiet
or moves to new port"] --> B["MAC aging timer expires"]
B --> C["FDB entry removed
from old port"]
C --> D["ARP entry persists
timeout much longer"]
D --> E["Topology engine
next poll cycle"]
E --> F["FDB: no entry or
stale entry for MAC"]
F --> G["ARP: IP to MAC
still mapped"]
G --> H["Positioning returns
wrong or missing port"]
H --> I["Unicast floods until
MAC is relearned"]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Aging timer asymmetry (ARP vs MAC) | Switch has ARP entry but no FDB entry for the MAC; unicast floods | Compare ARP timeout vs MAC aging timer on the device |
| Polling cadence longer than aging timer | Topology engine is always one cycle behind; entries expire between polls | Compare poll interval against MAC aging timer |
| Endpoint mobility (vMotion, failover) | MAC appears on new port but old entry persists in topology cache | Check FDB for the MAC across multiple switches in the VLAN |
| Stale entries not aged aggressively | Offline hosts remain in FDB for hours; positioning still reports high confidence | Check aging timer configuration; compute time since last refresh |
| FDB table full or eviction | Intermittent connectivity for previously-stable hosts; oldest entries evicted | Check FDB size against platform capacity |
| Aging disabled (value 0 on Linux bridge) | Entries never expire; offline hosts persist permanently | Check ageing_time setting; 0 means disabled |
| OVS static FDB aging bug (pre-Oct 2022) | Dynamic entries never age out when static entries coexist | Check OVS version against fix commit ccc24fc88d59 |
Quick checks
These are read-only commands safe to run during an incident.
# Check FDB entry count on a switch (Q-BRIDGE-MIB, VLAN-aware)
snmpwalk -v2c -c <community> <switch> .1.3.6.1.2.1.17.7.1.2.2.1.1 | wc -l
# Show MAC address-table count via CLI
ssh <switch> 'show mac address-table count'
# Check ARP cache summary
ssh <device> 'show ip arp summary'
# Look up a specific MAC in the FDB (BRIDGE-MIB)
# Note: BRIDGE-MIB stores MAC as dotted-decimal octets in the OID index,
# e.g. aa:bb:cc:dd:ee:ff appears as 170.187.204.221.238.255 in the walk output.
snmpwalk -v2c -c <community> <switch> .1.3.6.1.2.1.17.4.3.1 | grep <dotted-decimal-mac>
# Check STP topology change count (spike triggers accelerated MAC aging)
snmpget -v2c -c <community> <switch> .1.3.6.1.2.1.17.2.4.0
# Check aging timer on a Linux bridge
ip -d link show br0 | grep ageing_time
# Check aging timer on OVS
ovs-vsctl get bridge br0 other_config:mac-aging-time
# Cross-reference ARP vs FDB for a specific endpoint
ssh <switch> 'show ip arp <IP>'
ssh <switch> 'show mac address-table address <MAC>'
# Check ARP cache via SNMP (IPv4 and IPv6 unified table)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.4.35.1
# Check OVS FDB eviction and movement counters (OVS 2.10+)
ovs-appctl fdb/stats-show <bridge>
How to diagnose it
Identify the endpoint. Get the MAC address and last-known IP. If you only have an IP, look it up in ARP first.
Check the FDB for the MAC on the expected switch. Use
show mac address-table address <MAC>or SNMP walk of the Q-BRIDGE-MIBdot1qTpFdbAddressat.1.3.6.1.2.1.17.7.1.2.2.1.1. If the entry is absent, the MAC has aged out or moved. If present, note the port and VLAN.Check the ARP table for the IP. Use
show ip arp <IP>or SNMP walk ofipNetToPhysicalEntryat.1.3.6.1.2.1.4.35.1. If ARP has the entry but FDB does not, you have aging timer asymmetry. The switch knows the IP-to-MAC mapping but has lost the MAC-to-port mapping.Check neighboring switches. If the endpoint moved, the FDB entry may exist on a different switch in the same L2 domain. Walk the FDB on access switches connected to the same VLAN.
Check the aging timer configuration. Compare the MAC aging timer against your topology engine’s polling cadence. If aging (300s default) is shorter than or close to your poll interval, the topology engine is systematically behind reality.
Check STP topology changes. A spike in
dot1dStpTopChangesat.1.3.6.1.2.1.17.2.4indicates reconvergence. Under legacy STP (802.1D), a topology change causes the MAC aging timer to be shortened to the forwarding delay (default 15 seconds) for the duration of the TCN, approximatelymax_age + forward_delay(35 seconds with defaults). Under RSTP (802.1w), MAC table entries are flushed on ports receiving a topology change rather than shortening the global aging timer.Check FDB capacity. If the FDB is near capacity, entries are evicted to make room for new ones. On OVS,
ovs-appctl fdb/stats-show <bridge>exposes learning statistics. The eviction algorithm targets the oldest entry from the port with the highest number of FDB entries.Check topology inference confidence. If your topology engine exposes a per-endpoint confidence score, check whether it dropped for the endpoint in question. Low confidence means the engine itself is uncertain, but many platforms present low-confidence results with high-confidence UI treatment.
Check endpoint positioning orphan rate. If your topology engine tracks MAC addresses it cannot resolve to a physical switch port, a rising orphan rate indicates systematic positioning failure.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| FDB entry count | Approaching capacity causes flooding and eviction | Sustained growth; above 80% of platform limit |
| FDB entry freshness (computed from polling deltas) | Stale entries mislead endpoint positioning | No refresh in 3 to 4 times the aging interval |
| ARP cache entry count and staleness | Long-lived ARP entries outlive MAC entries, hiding stale mappings | ARP entries persisting beyond MAC aging timeout |
| STP topology change count | Reconvergence flushes and rebuilds FDB | Sudden spike in dot1dStpTopChanges |
| Topology view consistency (CDP/LLDP vs FDB vs ARP) | Disagreement reveals stale data or topology change in progress | Persistent inconsistency across sources |
| Topology inference confidence score | Low confidence means positioning is unreliable | Sustained drop, especially after mobility events |
| Endpoint positioning orphan rate | MACs unresolved to physical ports indicate positioning failure | Above 5% of discovered endpoints sustained |
| Per-port MAC count | Abnormal concentration on one port indicates daisy-chain or attack | Single access port with unexpectedly high count |
Fixes
Align aging timers across ARP and MAC
The most common root cause is asymmetric aging. If your platform allows tuning both timers, narrow the gap. On Cisco IOS, set MAC aging with mac address-table aging-time <seconds>. On Linux bridges, use ip link set dev <br> type bridge ageing_time <seconds>. On OVS, use ovs-vsctl set bridge <br> other_config:mac-aging-time=<seconds>.
Tradeoff: shorter aging timers cause more unknown-unicast flooding and relearning overhead. Longer timers cause more staleness. Virtualization environments with frequent vMotion benefit from shorter timers. Stable access-layer environments can tolerate longer ones.
Increase polling cadence for FDB and ARP
If your topology engine polls FDB every 5 minutes but MAC aging is 300 seconds, the engine is always one cycle behind. Either increase polling frequency or reduce the aging timer so entries survive across poll cycles.
Tradeoff: walking a large FDB table on a data center switch with 50,000 MAC entries can take seconds and spike device control-plane CPU. Target large-table walks to specific VLANs or use streaming telemetry where available.
Trigger gratuitous ARP after failover
After a failover event (HSRP, VRRP, F5 LTM HA), switches retain stale MAC and ARP entries pointing to the old active node. Explicitly clearing the MAC address table or triggering gratuitous ARPs accelerates convergence.
WARNING: clear mac address-table dynamic flushes all dynamically learned MAC entries on the switch. This causes temporary unicast flooding on all VLANs until entries are relearned. Run it only during a planned maintenance window or active failover. The same applies to clearing ARP (clear ip arp) on Cisco platforms.
Tradeoff: clearing the MAC table disrupts traffic briefly while entries are relearned. Use only during planned failover windows.
Patch or upgrade OVS if static entries block aging
Prior to a fix merged in late 2022 (commit ccc24fc88d59), OVS had a bug where the presence of static FDB entries prevented learned dynamic entries from aging out normally. If you are running an older OVS version and mixing static and dynamic FDB entries, upgrade.
Enable port-security to limit per-port MAC count
Port-security limits the number of MAC addresses learned per port, preventing stale or rogue entries from accumulating. Its absence is itself a finding.
Tradeoff: too-low limits disrupt legitimate virtualization hosts with many VM MACs. Tune per port class.
Prevention
- Compute FDB/ARP freshness from polling deltas. Standard MIBs do not expose “time since last refresh.” You must derive it by tracking when entries appear and disappear across poll cycles. Flag entries with no refresh in 3 to 4 times the aging interval as stale.
- Track topology inference confidence per endpoint. Do not rely on aggregate confidence. Per-endpoint confidence tells you which specific endpoints have unreliable positioning. Endpoints behind wireless APs and VMs with MAC mobility often have persistently low confidence.
- Cross-validate topology sources. Compare CDP/LLDP neighbor data against FDB and ARP. When three sources agree, confidence is high. When one source disagrees, investigate. Persistent inconsistency across sources for more than 24 hours indicates a stale-data problem.
- Monitor endpoint positioning orphan rate. Track the percentage of MAC addresses the topology engine cannot resolve to a physical switch port. A rising orphan rate after network changes indicates systematic positioning failure.
- Align aging timers across the L2 domain. Ensure MAC aging and ARP timeout are configured consistently across switches in the same broadcast domain. Mismatched timers between adjacent switches cause asymmetric staleness.
- Verify aging is not disabled. On Linux bridges,
ageing_timeof 0 disables aging entirely. Entries never expire. Check this on every bridge in production.
How Netdata helps
- Correlate FDB entry count with STP topology changes. Collect
dot1dStpTopChangesand FDB entry count via SNMP in the same dashboard. A spike in topology changes followed by FDB churn is the signature of forced MAC relearning. - Alert on FDB capacity thresholds. Set alarms on FDB entry count approaching platform limits to catch table exhaustion before it causes eviction and flooding.
- Track ARP cache size alongside FDB size. Divergence between ARP entry count and FDB entry count is the core symptom of aging timer asymmetry. Collect both via SNMP and alert when the gap grows.







