SNMP timeouts and retries: why devices show as down when they aren’t

SNMP runs over UDP port 161, a transport with no delivery guarantee. When your monitoring platform reports that devices are down, the first question is not “why is the network broken” but “is this actually a network problem, or is my polling stack the problem.” SNMP timeout and retry behavior is one of the most common causes of false-positive “device down” alerts, and it is also one of the most misdiagnosed.

The classic symptom: a dashboard lights up with dozens of devices showing as unreachable, but when you SSH into them or ping their management IPs, they respond fine. The SNMP agent is alive, the network path is intact, and the device is forwarding traffic. What failed was the SNMP poll, not the device.

This article covers how to distinguish a real outage from an SNMP timeout cascade, how to identify whether the bottleneck is on the collector side or the device side, and how to tune timeout and retry parameters so they stop producing false alarms without masking real failures.

What this means

When an SNMP poll times out, the collector marks the device as unreachable for that poll cycle. If retries are configured, the collector sends additional requests before declaring the device down. Each retry consumes a worker thread for the duration of the timeout window. The cumulative effect across many devices can starve the poller pool, causing other devices to time out in turn. This is the SNMP polling storm cascade: a feedback loop where retries amplify load on both the collector and the monitored devices.

Because SNMP over UDP provides no transport-layer acknowledgment or retransmission, a dropped, congested, or delayed request packet is indistinguishable from a genuinely unreachable device at the application layer. The retry mechanism is the only recovery path, and its tuning determines whether your monitoring produces signal or noise.

A single missed poll is normal under transient packet loss. The standard practice is to wait for 3 consecutive failed polls at the configured interval before declaring a device down. Sustained SNMP failure with healthy ICMP and syslog means an agent-specific problem, not a device outage, and should be demoted in severity.

flowchart TD
    A[SNMP poll starts] --> B{Response within timeout?}
    B -- Yes --> C[Device marked UP]
    B -- No --> D[Timeout fires]
    D --> E{Retries remaining?}
    E -- Yes --> F[Send retry, hold worker]
    F --> B
    E -- No --> G[Device marked DOWN]
    G --> H[Worker released late]
    H --> I[Other polls delayed]
    I --> J[More timeouts fleet-wide]
    J --> K[False-positive cascade]

Common causes

CauseWhat it looks likeFirst thing to check
Collector overloadMany devices time out simultaneously; ICMP to those devices succeedsCollector CPU, poll cycle duration vs. configured interval
Device control-plane CPU saturationOne device times out intermittently; device CPU is elevated during poll windowsDevice CPU via SNMP or CLI
Aggressive bulk MIB walksA specific OID walk consistently exceeds timeoutPer-device SNMP latency; which OID was being polled
Management network congestionICMP RTT to device is elevated during polling windowsICMP latency baseline vs. SNMP poll latency
Incorrect community string or credentialsDevice never responds; poll always reaches full timeoutSNMP authentication failure counters
SNMPv3 auth/priv overheadSNMPv3 polls are slower than SNMPv2c on the same deviceCompare v2c vs. v3 latency on same OID
Timeout too low for WAN or SD-WAN pathsRemote branch devices time out intermittently; local devices are fineICMP RTT to the remote device

Quick checks

# Check if the device responds to SNMP at all, with a generous timeout and no retries
time snmpget -v2c -c <community> -t 5 -r 0 <device> .1.3.6.1.2.1.1.3.0

# Compare SNMP reachability against ICMP reachability
ping -c 5 -i 0.2 <device>

# Check device control-plane CPU (Cisco cpmCPUTotal5secRev; index .1 is the first CPU entry)
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.109.1.1.1.1.7.1
# If the index is unknown, walk the table instead:
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.9.9.109.1.1.1.1.7

# Check SNMP authentication failure counter (snmpInBadCommunityNames)
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.11.4.0

# Measure SNMP poll latency variance over 10 polls (requires /usr/bin/time, not the shell builtin)
for i in $(seq 1 10); do
  /usr/bin/time -f "%e" snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0 2>&1 | tail -1
done

# Check collector CPU per core (look for single-core saturation from RSS misconfiguration)
mpstat -P ALL 1 5

# Check whether the Netdata SNMP collector is behind on its poll cycle
curl -s http://localhost:19999/api/v1/allmetrics | grep -iE 'snmp.*timeout|poll.*duration|collector.*duration'

How to diagnose it

  1. Corroborate with ICMP. Ping the devices showing as down. If ICMP succeeds but SNMP does not, the device is alive and the problem is SNMP-specific: agent overload, ACL blocking port 161, credential mismatch, or a collector-side issue. If both fail, the problem is the network path or the device itself.

  2. Determine the blast radius. Check whether the timeouts affect one device, a group of devices sharing the same network path, or the entire fleet. One device points to a device-side issue. A group sharing a path points to management-network congestion. The entire fleet points to the collector.

  3. Check the poll cycle duration. If the collector takes longer to complete a full poll cycle than the configured interval, polls are drifting. Data goes stale before it is collected. This is the most under-monitored meta-signal in network monitoring: when the poller falls behind, all other signals become unreliable.

  4. Identify the slowest device. Look at per-device SNMP latency. A single device with a 10-second bulk walk can stall a worker thread and cascade delays across the schedule. Large table walks such as cdpCacheTable or dot1dTpFdbTable on a large switch routinely exceed default timeouts.

  5. Check device control-plane CPU. SNMP processing consumes device CPU. If CPU is above 70 percent during poll windows, the SNMP agent is being starved. Walking a large FDB table on a data center switch with tens of thousands of MAC entries can spike device CPU for seconds and degrade other control-plane functions like BGP hold-time processing.

  6. Check for authentication failures. A wrong community string or expired SNMPv3 credential causes the device to silently drop the request. The poll waits the full timeout before failing. Check the authentication failure counter at .1.3.6.1.2.1.11.4.0 for SNMPv2c or the USM stats OIDs for SNMPv3.

  7. Check ICMP RTT to the device. If RTT to the management IP is elevated, the management path is congested. SNMP requests and replies traverse that same path. SNMPv3 with auth and privacy adds processing overhead compared to SNMPv2c, compounding the delay on high-latency paths.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
SNMP timeout rateDirect measure of polling reliabilityAbove 5 percent sustained is early warning; above 30 percent means data is unreliable for alerting
SNMP poll response latencyIndicates agent or path degradationLatency above 1 second on a normally fast device
Poll cycle duration vs. intervalDetects schedule drift before it causes false downsCycle duration exceeding configured interval
Device control-plane CPUSNMP starvation under CPU pressureAbove 70 percent sustained; above 90 percent warrants paging
ICMP reachability and RTTDistinguishes network path issues from SNMP issuesSNMP down with ICMP up equals agent or collector problem
Collector CPU per coreSingle-core saturation from RSS or thread contentionOne core at 100 percent while others are idle
SNMP authentication failuresCredential mismatch or scanningAny nonzero in production
Poller worker queue depthGrowing queue means collector cannot keep upQueue depth above baseline or growing without bound

Fixes

Collector-side: scheduler and worker pool

If many devices time out simultaneously and ICMP to those devices succeeds, the collector is the bottleneck. Reduce poller concurrency so the collector does not oversubscribe its own CPU or the management network. A smaller concurrency limit with a well-tuned timeout produces more reliable results than a large pool with aggressive timeouts.

Check the poll cycle duration against the configured interval. If the cycle takes longer than the interval, the collector is already behind. Either reduce the number of devices per collector, increase the poll interval, or distribute devices across additional collectors.

Timeout and retry tuning

The net-snmp CLI defaults are documented as 1 second timeout and 5 retries. For LAN-attached devices with fast agents, these defaults are often adequate. For WAN, SD-WAN, or cloud-managed devices, they are frequently too aggressive.

The key principle: set the timeout to accommodate the worst-case legitimate response time, then set retries conservatively. A single timeout should be long enough to absorb network RTT plus agent processing time plus a safety margin. If ICMP RTT to the device is 50ms, a 1-second timeout is generous. If the device is across an SD-WAN overlay with 200ms RTT and the agent takes 500ms to respond to a bulk walk, a 1-second timeout will fail intermittently.

A practical rule from operator consensus: ensure the cumulative worst-case timeout (per-request timeout multiplied by retries plus one) is shorter than the poll interval. This prevents a single stalled device from overlapping into the next scheduled poll.

Do not increase both timeout and retries simultaneously. Some NMS implementations use exponential or doubling backoff per retry. An initial timeout of 8 seconds with 5 retries can yield a cumulative timeout exceeding several minutes for a single device, which ties up a worker thread and accelerates schedule drift.

Slow device vs. packet loss: opposite tuning

The tuning direction depends on the failure pattern:

  • Scattered individual timeouts suggest slow device response. Increase the timeout by roughly 50 percent and reduce retries by one. This gives each request more time to succeed without compounding the cumulative delay.
  • Entire subnet or path timing out suggests packet loss on the management network. Increase retries by one and reduce the timeout by roughly 50 percent. Additional retry packets improve the probability of delivery over a lossy path. A shorter timeout means each retry fires sooner.

Device-side: control-plane CPU

If the device times out only during specific operations such as large bulk walks or FDB table queries, exclude those OID walks from the default polling schedule. Walk them on a separate, longer-interval schedule, or use targeted polling for specific OIDs instead of full table walks.

Check whether the device has a control-plane policer (CoPP) that is rate-limiting SNMP. On some platforms, SNMP is classified as a low-priority control-plane service and is dropped first under CPU pressure.

Credential and ACL issues

If a device never responds to SNMP but ICMP succeeds, verify the community string or SNMPv3 credentials. An incorrect community string causes the device to silently drop the request, and the poll waits the full timeout before failing. Check the SNMP authentication failure counter on the device. Also verify that the collector’s source IP is permitted by the device’s SNMP ACL on UDP port 161.

Prevention

  • Monitor the SNMP timeout rate as a first-class signal. A timeout rate above 5 percent sustained is the early-warning threshold. Above 30 percent, polling data is unreliable for alerting and capacity decisions.
  • Monitor poll cycle duration against the configured interval. If the cycle duration exceeds the interval, alerts based on stale data will produce false positives.
  • Baseline SNMP latency per device. A device that was responding in 20ms and is now responding in 800ms is degrading. Catching this before it crosses the timeout threshold prevents a false-down alert.
  • Separate bulk walks from standard polling. Large table walks (FDB, CDP cache, ARP) should run on their own schedule with their own timeout, not in the same worker pool as device-liveness polls.
  • Size the worker pool to the collector’s capacity, not to the device count. A collector with 500 devices and a small fixed worker pool will fall behind if any device stalls.
  • Correlate SNMP status with ICMP status in alerting. Suppress “SNMP device down” alerts when ICMP to the same device is healthy. SNMP failure with healthy ICMP is an agent or collector problem, not a device outage, and should be opened as a lower-severity ticket rather than a page.

How Netdata helps

  • Netdata’s SNMP collector exposes per-device poll latency and timeout metrics, letting you identify the slow device before its delays cascade to the rest of the schedule.
  • Per-core CPU metrics on the collector host catch single-core saturation from RSS misconfiguration or thread contention before it causes a polling cascade.
  • Correlating SNMP agent reachability with ICMP reachability and syslog activity in a single dashboard makes the distinction between “agent down,” “device down,” and “network down” immediate.
  • Netdata’s anomaly detection on SNMP response latency surfaces gradual degradation before it crosses the timeout threshold and triggers a false alert.
  • Built-in alerting can require multiple corroborating signals before paging, reducing false-positive device-down alerts without missing real outages.