Cold-start topology: why your map is incomplete after a collector restart
You restart your flow collector or topology engine for a routine upgrade. The process comes back up cleanly. The dashboard loads. But the topology map is half-empty, endpoint positions are wrong, and within minutes someone pages you asking why a security investigation points to the wrong switch port.
The root cause is not a bug. After a restart, the topology inference engine has no cached neighbor tables, no FDB entries, no ARP data, and potentially no flow templates. It must rebuild all of these from live polling and flow data before it can produce a reliable view. The window between restart and first complete topology ranges from a few minutes to over 30 minutes, depending on poll cadence, template refresh intervals, and which sources your topology engine fuses.
The danger: topology engines often return answers during this warmup window without flagging them as incomplete. Confidence scores may be low or absent from the UI. Operators query endpoint positions, get results, and act on them. A security investigation traces a MAC address to a stale or partially-constructed switch port, and the wrong team gets paged.
What happens during cold start
Cold-start topology is the state where a topology inference engine has been restarted and has not yet accumulated enough data from its input sources to produce a complete and reliable topology view.
The engine fuses multiple independent data sources to derive Layer-2 and Layer-3 topology. These include CDP/LLDP neighbor tables, FDB entries, ARP tables, STP state, routing tables, and flow records. Each source repopulates at its own cadence after a restart:
- CDP/LLDP neighbor data repopulates when the next SNMP poll cycle reaches each device and walks the neighbor tables.
- FDB and ARP entries repopulate as devices learn MACs and resolve IPs, which only happens as traffic flows through them. A switch port with no active traffic will have an empty FDB entry regardless of what is physically connected.
- Flow records for NetFlow v9/IPFIX require the collector to receive a template before any data records can be decoded. Templates are sent over UDP on a configurable interval, typically 5 to 30 minutes. Until the first template arrives, all data records from that exporter are silently discarded.
- Endpoint positioning (which switch port a given MAC or IP is connected to) is probabilistic, derived from the agreement of multiple sources. With partial input, confidence is low.
Until enough sources converge, the topology view is partial, confidence scores are low, and endpoint positioning queries may return stale, cached, or incorrect data.
flowchart TD
A[Collector or topology engine restart] --> B[Template cache wiped]
A --> C[Neighbor table cache cleared]
A --> D[FDB and ARP cache cleared]
B --> E[Waiting for template refresh from exporter]
E -->|5 to 30 min typical| F[Templates received]
F --> G[Flow records decodable]
C --> H[First poll cycle completes]
D --> I[Devices repopulate FDB and ARP as traffic flows]
H --> J[CDP and LLDP neighbors mapped]
I --> K[Endpoint positions inferable]
G --> L[Flow-derived topology available]
J --> M{Multiple sources converging?}
K --> M
L --> M
M -->|poll cycle x 3 typical| N[Full topology with high confidence]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Template cache eviction after collector restart | Flow datagrams arriving but zero records decoded; collector logs show template not found or cache miss | Collector logs for template-related messages |
| Topology engine restart with no persistent state | All confidence scores at zero; endpoint queries return unknown or stale cached results | Topology confidence score endpoint or dashboard |
| Slow poll cycle on a large device estate | Topology slowly fills in over many minutes; some devices still missing after the first cycle | Poll cycle duration vs configured poll interval |
| FDB/ARP not yet repopulated on devices | Endpoints show as orphaned or positioned at the wrong port | FDB entry count and freshness on key switches |
| UDP template packet loss during cold start | Templates expected but not received; gap persists beyond nominal refresh interval | tcpdump on collector NIC for template datagrams |
Quick checks
All commands are read-only and safe during an active investigation.
# Check if flow records are being decoded vs just received
curl -s http://localhost:<stats-port>/metrics | grep -E 'flow.*received|flow.*decoded'
# Look for template cache miss messages in collector logs
grep -i 'template' /var/log/<collector>.log | tail -20
# Check topology inference confidence score
curl -s http://localhost:<port>/api/topology/confidence | jq
# Check poll cycle duration vs configured interval
curl -s http://localhost:<port>/metrics | grep -E 'poll.*cycle|poll.*duration'
# Verify FDB is repopulating on a key switch (Q-BRIDGE-MIB, VLAN-aware)
snmpwalk -v2c -c <community> <switch> .1.3.6.1.2.1.17.7.1.2.2.1.1 | wc -l
# Confirm UDP datagrams are arriving at the collector (NetFlow on 2055, IPFIX on 4739)
tcpdump -i eth0 -nn 'udp port 2055' -c 100
# Check UDP socket buffer drops (data may be arriving but being dropped)
cat /proc/net/snmp | grep '^Udp:'
# Verify NTP sync on the collector (clock skew widens the template gap)
chronyc tracking 2>/dev/null || ntpq -p
How to diagnose it
Confirm the restart happened and note the timestamp. Check collector process start time or sysUpTime. The warmup window starts from the restart, not from when you first noticed the problem.
Determine whether the gap is template-related, topology-engine-related, or both. If flow records are received but decoded is zero, the template cache is the bottleneck. If flow decoding is working but the topology is still sparse, the topology engine is still warming up from polling data.
Check which input sources have repopulated. Run CDP/LLDP walks, FDB walks, and ARP walks on key devices. Compare entry counts to your known baseline. A switch that normally has 500 MACs showing 50 means the FDB has not repopulated.
Query the topology confidence score. If your engine exposes it, check whether average confidence is still near zero or climbing. A confidence score below baseline after a restart means the topology is not yet trustworthy.
Calculate the expected warmup window. Allow poll cycle duration multiplied by 3 for a first complete topology view. Add the template refresh interval if your topology engine depends on flow data for endpoint positioning. Platforms with a 60-second template refresh produce a short gap; platforms with a 30-minute refresh produce a substantial one.
Verify that templates are actually arriving. Use tcpdump to confirm template datagrams are reaching the collector. If the exporter’s template refresh interval has passed and no template has arrived, suspect UDP packet loss or a network path issue between the exporter and collector.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Topology inference confidence score | Tells you whether endpoint positions are reliable enough to act on | Average confidence below baseline or near zero after restart |
| Flow records decoded vs received ratio | Template cache miss means data records are silently discarded | Received greater than zero but decoded equals zero |
| FDB/MAC table freshness | Stale or empty FDB means endpoint positioning will be wrong or missing | Entry count well below baseline; entries older than 3 to 4 times the refresh interval |
| Poll cycle duration vs configured interval | Slow cycles delay the topology rebuild proportionally | Cycle duration approaching or exceeding the configured poll interval |
| ARP cache entry count and staleness | Stale ARP means wrong IP-to-MAC mappings for endpoint positioning | Entry count well below baseline after restart |
| Template cache hit/miss ratio | Misses indicate the collector cannot decode incoming flow records | Miss ratio climbing after restart without recovery |
UDP socket buffer drops (Udp_RcvbufErrors) | Template packets may be dropped before the collector processes them | Nonzero and incrementing during the cold-start window |
Fixes
Template cache gaps
The most impactful fix is shortening the template refresh interval on exporters before planned maintenance. On Cisco IOS, you can temporarily force frequent template resends from config mode:
! Force template resend every 1 packet - restore normal rate after
ip flow-export template refresh-rate 1
This is safe and non-disruptive to data forwarding. Restore the normal refresh rate after the collector confirms template receipt.
For collectors that support it, configure template cache persistence across restarts. nfdump and similar tools require explicit cache persistence configuration.
If your collector does not persist templates, the gap is unavoidable on restart. Plan restarts outside of security-critical windows.
Topology engine warmup
The primary fix is operational discipline, not configuration:
- Wait for poll cycle x 3 before trusting topology queries. This is the standard guidance for first complete topology view after a collector restart.
- Check confidence scores before acting on endpoint positioning results. If confidence is low, the answer is not trustworthy.
- Suppress automated actions that depend on topology during the warmup window. If your incident response automation pages a team based on endpoint positioning, add a confidence check or a post-restart cooldown period.
FDB and ARP repopulation delays
FDB and ARP entries only repopulate as traffic flows. On quiet switch ports, the FDB may remain empty for extended periods. There is no safe way to force population without generating traffic.
If your topology engine depends on FDB freshness for endpoint positioning, ensure your poll cadence is fast enough relative to the FDB aging timeout on your switches. If the aging timeout is 4 hours and your poll cycle is 30 minutes, entries are refreshed frequently enough under normal operation. But after a restart, the first complete view still requires waiting for the poll to complete and for devices to have learned MACs.
Prevention
- Mark cold-start state explicitly. If your topology engine does not flag incomplete views, add a wrapper or dashboard annotation that shows time since last restart and expected warmup completion time.
- Monitor confidence scores as a first-class signal. Alert when average confidence drops below baseline for sustained periods, not just after restarts.
- Use NTP synchronization on collectors and exporters. Clock skew can cause template refresh timestamp comparisons to reject otherwise valid templates, extending the effective blind spot beyond the nominal refresh interval.
- Plan restarts during low-risk windows. The cold-start gap is most dangerous when it coincides with a security event that requires flow forensics.
- Consider NetFlow v5 for fixed-format export where template gaps are unacceptable. NetFlow v5 has a fixed record format and does not require template exchange. However, v5 is deprecated on many modern platforms in favor of v9/IPFIX.
- Shorten template refresh intervals on exporters. A 60-second refresh produces a shorter gap than a 30-minute refresh. Balance against the increased management-plane traffic from more frequent template sends.
How Netdata helps
- Monitors the collector’s flow packet receive rate alongside the decoded record rate, making template cache gaps visible as a divergence between received and decoded counters.
- Collects
Udp_RcvbufErrorsfrom/proc/net/snmpby default on Linux, catching template packets dropped at the kernel socket buffer before the application sees them. - Per-core CPU metrics help verify the collector is not CPU-starved during the ingestion burst that follows restart, when all exporters resume sending simultaneously.
- Collector disk space and I/O metrics catch the TSDB write spike that accompanies cold-start ingestion.
- SNMP poll latency and timeout metrics help verify the poller is completing cycles fast enough for the topology engine to rebuild within the expected window.
- Cross-metric correlation in Netdata dashboards lets you align collector restart timestamps with confidence score drops, flow decode gaps, and FDB repopulation curves in a single view.
Related guides
- Asymmetric routing: why your path and latency measurements lie
- Audit log gaps: detecting syslog/trap tampering or loss
- BGP flapping: why a peer keeps resetting and how to find the cause
- BGP NOTIFICATION and Cease messages: what each subcode is telling you
- BGP RIB and FIB growth: monitoring route-table size before it bites
- BGP route leak and hijack: the detection signals and alerts that matter
- BGP session Established but stale: detecting silent route loss
- NetFlow storage sizing: how much disk your flow collector really needs
- Flow export-to-ingest latency: why your NetFlow data is minutes behind
- Network monitoring checklist: the signals every production network needs
- NetFlow v9/IPFIX template desync: flows decoded wrong or dropped after a reboot
- Silent UDP flow data loss: why your NetFlow collector is dropping records







