Network monitoring checklist: the signals every production network needs

This checklist covers the signals production networks need, organized by detection priority and mapped to maturity levels from survival to expert.

An NPM stack is a federation of collectors, parsers, enrichment services, storage tiers, and an analytics core. Most production incidents are not “the network broke” but “a collector’s UDP buffer dropped packets,” “the NetFlow v9 template cache went stale after a device reboot,” or “the polling worker pool fell behind and now a healthy device looks down.” The checklist is organized to surface those failure modes, not just the top-level symptoms.

The federation at a glance

When a signal is missing, stale, or wrong, the fault is usually one or two subsystems upstream of the dashboard. The typical NPM stack includes:

  • Time synchronization substrate (NTP/PTP). Every cross-collector correlation depends on accurate, monotonic time across collectors, polled devices, and API endpoints.
  • Polling transport. ICMP, UDP/161 (SNMP), TCP/22 (SSH/CLI scrape), and HTTPS (vendor APIs) reaching each managed endpoint.
  • SNMP polling engine. A scheduler fanning OID requests across devices with timeouts, counter tables, and a device state machine (UP / STALE / UNKNOWN / DOWN).
  • Flow collection subsystem. NetFlow v5/v9 and IPFIX collectors with template caches, sampling-rate awareness, and flow record storage. sFlow is sample-datagram oriented, not template-flow, and has a different failure profile.
  • Topology inference engine. Fuses CDP/LLDP neighbor tables, FDB entries, ARP tables, STP state, and routing tables to derive Layer-2 and Layer-3 topology.
  • BGP monitoring subsystem. Active or passive sessions tracking FSM state, prefix announcements, AS-path changes, and RPKI validity.
  • Syslog and trap ingestion. UDP/TCP/TLS listeners with parser backpressure, facility/severity handling, and deduplication.
  • Vendor API integration layer. Pull-mode clients for SD-WAN controllers, cloud platforms, and modern firewalls, each with their own auth, rate limit, and pagination semantics.
  • Storage tiers. Counter TSDB (downsampled for long retention), full-resolution flow store, topology graph DB, raw syslog store, and event/alert log.

Signal domains by detection priority

The domains below are ordered by detection priority: the earliest surfacing of real issues with the best signal-to-noise comes first. Within each domain, the most operationally critical signals are listed first.

Availability

SignalSourceWhy it matters
SNMP agent reachability (sysUpTime)SNMP GET .1.3.6.1.2.1.1.3.0No response means agent down, partition, ACL block, or credential issue. Value decrease means reboot. SNMP down with healthy ICMP means agent problem, not device outage.
ICMP reachabilityping, fpingLiveness independent of SNMP. ICMP down plus SNMP down equals network problem. ICMP down plus SNMP up equals ICMP rate-limited or blocked (common on firewalls, CoPP).
Vendor API reachability and validityHTTPS to vendor endpointFor SD-WAN/cloud, the API may be the only telemetry source. HTTP 200 with empty or error payload (PAN-OS <response status="error"> inside HTTP 200) is a silent failure.
Flow UDP packet receipt rate/proc/net/udp, collector stats, nstatDrop to 0 from one exporter means exporter stopped or partitioned. Drop from all exporters means collector-side failure.
Syslog receipt rate and severityUDP/TCP/TLS port 514 listenerRate spike with severity escalation means device event. Spike without escalation means noise storm. Silence from a normally-chatty device means isolation or logging failure.
SNMP trap rate and typeUDP port 162 listenerlinkDown/linkUp pairs mean flap. coldStart means reboot. Silence from a noisy device means trap path broken.
BGP session state (FSM)BGP4-MIB .1.3.6.1.2.1.15.3.1.2, CLI, BMPEstablished means exchanging routes. Established with no UPDATE traffic (stale session) is a worse failure than Idle.
Interface operational statusIF-MIB .1.3.6.1.2.1.2.2.1.8 (ifOperStatus)Admin up plus oper down means physical or link-layer failure. Flapping means link instability.

Errors

SignalSourceWhy it matters
Interface errors (ifInErrors, ifOutErrors)IF-MIB .1.3.6.1.2.1.2.2.1.14, .20Incrementing counters mean cable/fiber degradation, SFP failure, duplex mismatch, or EMI. Rate of change matters more than absolute value.
Interface discards (ifInDiscards, ifOutDiscards)IF-MIB .1.3.6.1.2.1.2.2.1.13, .19Queue or buffer overflow, or ACL drops. Often the leading indicator of congestion before utilization shows 100%.
UDP socket buffer drops/proc/net/snmp, nstat -az Udp_RcvbufErrorsThe number one silent killer for flow, trap, and syslog collectors. Datagrams arrive at the kernel but the application was too slow to drain. Any nonzero value means lost telemetry.
SNMP timeout and retry rateCollector stats, time snmpgetRising across many devices means collector-side issue. Rising on one device means device-side agent or CPU issue.
BGP NOTIFICATION and Cease messagesbgpBackwardTransition trap, CLI, syslogCease/1 is maximum prefixes reached. Cease/2 is administrative shutdown. Hold Time Expired (NOTIFICATION code 4) indicates CPU saturation.
License and feature validityVendor MIBs, PAN-OS API, Meraki API, Cato GraphQLFeature silently disabled at midnight. Users complain at 09:00. The most common root cause of “the firewall stopped doing what we paid for.”

Saturation

SignalSourceWhy it matters
Interface utilization (% of ifHighSpeed)IF-MIB ifHCInOctets .1.3.6.1.2.1.31.1.1.1.6, ifHCOutOctets .10, ifHighSpeed .1595% sustained for over 5 min on critical interface means congestion with drops and latency. Use 64-bit HC counters. 32-bit ifInOctets wraps in approximately 3.4 seconds at 10G line rate.
NIC RX/TX drops on collector/proc/net/dev, ethtool -SRing buffer overflow before packets reach the socket layer. rx_missed_errors is the most actionable counter.
Collector CPU (per-core, %soft)mpstat -P ALL, /proc/softirqsHigh %soft on one core means RSS funneling all packet processing to one CPU. Total CPU may look fine while one core is pinned.
Collector disk and TSDB write queuedf, iostat, collector metricsCardinality inflation (new subnet, NAT pool, scanner traffic) can fill disk in hours. Write queue growing means TSDB cannot keep up with ingestion.
Device control-plane CPUCisco .1.3.6.1.4.1.9.9.109.1.1.1.1.7, Juniper .1.3.6.1.4.1.2636.3.1.13.1.8Sustained over 90% means SNMP starvation, BGP hold-time expiry, and session drops.
Device memory utilizationCisco .1.3.6.1.4.1.9.9.48.1.1.1.5, HOST-RESOURCES-MIBFree memory approaching 0 means OOM imminent. Rate of increase over 1%/min means memory leak.
BGP RIB and FIB sizeBGP4-MIB prefix counts, CLISudden change over 20% in 5 min means route leak or mass withdrawal. Full IPv4 DFZ in 2026 is approximately 940k prefixes.
NAT and session table utilizationPAN-OS API, vendor CLIApproaching limit means new connections denied. Sustained growth means traffic outpacing NAT capacity.
API rate-limit remainingHTTP headers (Retry-After, X-RateLimit-Remaining)Meraki: 10 req/sec/org.

Internal state, replication, and correctness

SignalSourceWhy it matters
Device uptime (sysUpTime)SNMP .1.3.6.1.2.1.1.3.0Decrease means reboot. 32-bit wrap at approximately 497 days looks like reboot; track wraps separately.
Temperature, fan, power supplyENTITY-SENSOR-MIB .1.3.6.1.2.1.99.1.1.1.4Thermal failure, cooling failure, or redundancy lost. Use vendor-defined thresholds, not arbitrary absolute numbers.
Interface counter discontinuityifCounterDiscontinuityTime .1.3.6.1.2.1.31.1.1.1.3Counter reset without sysUpTime reset means SNMP agent inconsistency or counter-source bug.
Cross-collector time skewntpq -p, chronyc trackingOver 100ms drift breaks cross-site flow correlation. Over 1s breaks it entirely.
NTP offset on monitored deviceshrSystemDate .1.3.6.1.2.1.25.1.2.0Device clock drift causes postmortem correlation failure. Consistently the most under-monitored NTP signal.
Topology view consistencyCDP/LLDP vs FDB vs ARP cross-validationInconsistency means stale data, topology change in progress, or device bug. Three sources agreeing is high confidence; one source alone is low.
Flow sampling rate consistencysFlow MIB, NetFlow v9 template fieldsMismatch means analytics wrong by orders of magnitude. Without sampling-rate correction, sFlow at 1:1000 reports 1/1000 of true traffic.
STP root bridge and TCNBRIDGE-MIB .1.3.6.1.2.1.17.2Root bridge change means reconvergence. TCN rate over 5/min means instability.

Latency, throughput, and security

SignalSourceWhy it matters
SNMP poll response latencytime snmpget, collector statsOver 1s on a normally-fast device means agent or management-network degradation.
ICMP round-trip timeping, fpingp99 over 2x rolling baseline means congestion or path change. High jitter means unstable path.
Active path probesCisco IPSLA RTTMON MIB, TWAMP, HTTP GETRTT and loss per path, independent of application. Loss over 1% sustained is degraded.
Flow bytes per conversationNetFlow/sFlow/IPFIX recordsTop talkers, DDoS patterns, data exfiltration signals. sFlow requires sampling-rate multiplication for accurate byte counts.
Poller poll cycle durationCollector internal statsCycle exceeding configured interval means data is drifting stale. The most under-monitored meta-signal in NPM.
Flow exporter drop rate (device-side)Cisco cnfESPktsDropped .1.3.6.1.4.1.9.9.387.1.4.6Device dropped flows that never reached collector. Invisible to collector alone. Compare device-exported rate against collector inbound rate for end-to-end loss detection.
Unauthorized SNMP accesssnmpInBadCommunityNames .1.3.6.1.2.1.11.4, USM statsBurst from single source means scanning. Persistent events from many sources means community string “public” still configured.
BGP RPKI/ROA invalid acceptanceVendor CLI show bgp rpki, validatorsAny RPKI-invalid route accepted in production is a security event. Verify with public validators before alerting; stale cache produces false invalids.
Config changes without ticketSyslog CONFIG-I, AAA logs, config diffChange outside maintenance window without change ticket means unauthorized or emergency. Change followed within 30 min by incident is a high-correlation root-cause candidate.

Monitoring maturity levels

These levels are sequential and cumulative. Each level includes everything below it.

flowchart TD
    L4["L4 Expert
BMP, RPKI integrity, per-VRF,
sampling-rate forensics"] --> L3["L3 Mature
UDP drops, RSS, flow end-to-end loss,
NTP on devices, topology confidence"] L3 --> L2["L2 Operational
CPU and memory, traps, licenses,
topology discovery, API status"] L2 --> L1["L1 Survival
sysUpTime, ifOperStatus, BGP state,
utilization, errors, syslog, trap port"]

L1: survival

The absolute minimum to know if the network is alive and not on fire:

  • SNMP reachability (sysUpTime GET) for every critical-path device
  • Interface operational status (ifOperStatus) for critical interfaces
  • BGP FSM state for critical eBGP and iBGP peers
  • Interface utilization (ifHCInOctets / ifHCOutOctets vs ifHighSpeed) for top-10 interfaces
  • Interface error counters (ifInErrors, ifOutErrors) for critical interfaces
  • Syslog severity 0-3 (EMERG through ERR) forwarded from critical devices
  • Flow collector port listening (UDP 2055 for NetFlow, 6343 for sFlow, 4739 for IPFIX)
  • Trap receiver bound on UDP 162

A team at L1 catches hard outages. Nothing else.

L2: operational

Everything in L1, plus:

  • All interfaces for status, utilization, errors
  • All BGP peers for FSM state and prefix count
  • SNMP poll latency and timeout rate per device
  • Device control-plane CPU and memory
  • Flow records received per second; syslog source count
  • License days-to-expiry for all licensed features
  • Temperature, fan, power supply state
  • Topology discovery (CDP/LLDP)
  • STP root bridge identity and topology change count
  • Vendor API HTTP status for SD-WAN and cloud
  • SNMP authentication failure rate
  • ColdStart/warmStart detection with alerting

A team at L2 has visibility into most failures. They still miss silent failures, license cliffs, and topology staleness.

L3: mature

Everything in L2, plus:

  • UDP socket buffer drops (Udp_RcvbufErrors) on flow, trap, and syslog collectors
  • NIC RX/TX drops and RSS IRQ distribution
  • Collector CPU (per-core, %soft) and disk space
  • TSDB write queue depth and series cardinality
  • Flow export-to-ingest latency and sampling rate consistency
  • Flow exporter drop rate (device-side) and inbound-vs-exported comparison
  • Interface counter discontinuity detection
  • NAT/session table utilization
  • Vendor API request latency, error rate, and rate-limit remaining
  • Active path probes (IPSLA/TWAMP/HTTP) on critical paths
  • Cross-collector time skew and NTP offset on monitored devices
  • Topology view consistency and inference confidence score
  • BGP route advertisement vs reception symmetry
  • Poller poll cycle duration vs configured interval
  • ARP cache entry count and staleness
  • RPKI/ROA validation state for all BGP sessions
  • BGP NOTIFICATION Cease subcode parsing (RFC 4486, RFC 8538, RFC 9384)
  • Configuration drift detection
  • Endpoint positioning orphan rate

A team at L3 catches most incidents in their early stages.

L4: expert

Everything in L3, plus the signals operators add after multiple major incidents:

  • BMP (RFC 7854) for Adj-RIB-In visibility (pre-policy and post-policy routes). BGP4-MIB bgp4PathAttrTable only reflects best-path routes; Adj-RIB-In entries are not accessible over SNMP.
  • BGP AS-path baseline deviation detection for own prefixes and upstreams
  • RPKI validator health monitoring; alert on “Unknown” rate changes (signals validator outage)
  • Sub-prefix hijack detection: alert when a more-specific appears without a less-specific in the RIB
  • Smart License and vendor license server reachability monitored continuously
  • Per-VRF and per-tenant isolation: BGP RIB size, flow volume, license utilization tracked per VRF
  • Per-priority-queue discard counters (vendor QoS MIBs) revealing QoS queue saturation behind moderate utilization
  • CoPP (control-plane policer) drop counters
  • NIC per-queue drop counters via ethtool -S (rx_missed_errors, rx_no_dma_resources)
  • /proc/net/softnet_stat for kernel packet processing backpressure
  • FDB/ARP entry freshness (time since last refresh, computed from polling deltas)
  • Flow template cache hit/miss ratio for NetFlow v9/IPFIX
  • License grace-period state with feature-specific counter validation (IPS drops at 0 when traffic flows after license expiry)
  • Asymmetric routing detection (forward vs reverse probe comparison)

The signals most teams miss

These are the systematic blind spots that keep causing incidents:

  1. UDP socket buffer drops are not monitored. Udp_RcvbufErrors is the number one missed signal in flow collection. Charts show declining traffic during incidents that are actually traffic spikes. Production flow collectors need net.core.rmem_max raised to 16 MB or higher, tuned to actual ingress volume.

  2. License expiry is monitored only when too late. A licensed feature (IPS, VPN, threat prevention) silently disables at midnight. The device stays up. The syslog message is low severity and buried. Users notice at 09:00.

  3. NTP drift on monitored devices is not watched. Two devices 200ms apart on the same flap produce records that do not correlate. Postmortems fail to reconstruct events because timestamps are seconds apart.

  4. BGP “Established but stale” is not detected. The FSM reports Established but UPDATE exchange stopped. Graceful Restart keeps the FSM green while the session is gone. Track bgpPeerInUpdates rate and the timestamp of last received prefix.

  5. Trap receiver drops are invisible. During a trap flood, the highest-priority trap (root cause) is statistically the most likely to be dropped. There is no per-source drop counter.

  6. Vendor API silent failures are not detected. HTTP 200 with empty payload is treated as “no data” rather than “API is broken.” PAN-OS returns <response status="error"> inside HTTP 200.

  7. NetFlow v9/IPFIX template desync is invisible. After a device reboot or upgrade, templates arrive on a 5-30 minute interval. Until then, all data records are silently discarded.

  8. Sampling rate normalization is skipped. sFlow analytics report raw counts without scaling. Bandwidth charts are wrong by the sampling factor (often 1:1000 or worse).

  9. 32-bit counter rollover is treated as a real spike. ifInOctets wraps in approximately 3.4 seconds at 10G line rate. Naive differencing produces terabit spikes or negative utilization.

  10. Poller fall-behind is not detected. The scheduler oversubscribes devices, retries compound, control-plane CPU spikes, and healthy devices appear “down.” The platform is the problem; the network is not.

How Netdata helps

Netdata collects many of the collector-side signals in this checklist that most NPM platforms miss:

  • SNMP data collection polls sysUpTime, ifOperStatus, ifHCInOctets/ifHCOutOctets, ifInErrors/ifOutErrors, ifInDiscards/ifOutDiscards, and device CPU/memory with configurable intervals down to 1 second.
  • Linux system plugins expose Udp_RcvbufErrors, NIC RX/TX drops from /proc/net/dev, per-core softirq from /proc/softirqs, and /proc/net/softnet_stat for kernel packet processing backpressure.
  • Network interface metrics include per-NIC ring buffer drops and ethtool -S counters like rx_missed_errors, so you can distinguish NIC-level drops from socket-buffer drops.
  • Cross-layer correlation lets you join rising Udp_RcvbufErrors with rising flow receive rate and rising collector CPU in a single view, which is the diagnostic chain for silent UDP flow loss.
  • NTP metrics surface offset and drift on collectors and, where SNMP exposes hrSystemDate, on monitored devices.