NetFlow v9/IPFIX template desync: flows decoded wrong or dropped after a reboot

You rebooted a router or upgraded its firmware. Minutes later, your flow collector shows a gap or anomaly. The exporter is still sending data: UDP packet counters are nonzero and climbing. But decoded flow records are zero, suspiciously low, or the field values are shifted and garbled.

This is NetFlow v9 or IPFIX template desync. The collector holds cached template definitions that no longer match what the exporter is sending. Until it receives and caches the correct templates, it either drops records silently or misinterprets the byte layout, producing garbage fields.

The data loss window ranges from seconds to 30 minutes, depending on how frequently the exporter retransmits templates. There is no decoder error in many cases: the flows simply stop appearing in the collector.

What this means

NetFlow v9 (RFC 3954) and IPFIX (RFC 7011) separate template definitions from data records. Templates describe the field layout: which Information Elements appear, in what order, and at what byte offsets. Data records carry only values, referencing a template ID. A collector cannot decode any data record until it has cached the matching template for the same (source IP, observation domain ID, template ID) triple.

Templates travel in their own FlowSets over UDP at a configurable interval (typically 5 to 30 minutes). Template FlowSets use IDs 0-255; data FlowSets use IDs of 256 or higher. When an exporter reboots or changes its flow configuration, template definitions may change and the exporter restarts its sequence number counter at zero. If the collector still holds stale templates keyed to the pre-reboot session, it either ignores the new templates or applies old field layouts to new data.

The canonical error signature is some variant of “template not yet received”, “no template definition received with id X”, or “unable to decode flow set data”. These messages indicate the decoder received a data FlowSet referencing a template ID it does not have in cache, so it discards the records.

flowchart TD
    A[Exporter reboots or upgrades] --> B[Sequence resets to 0]
    B --> C[Templates may change or re-emit]
    C --> D[Exporter sends templates on next refresh interval]
    D --> E{Template packet arrives at collector?}
    E -- Yes, before collector expiry --> F[Collector updates cache]
    F --> G[Flows decode correctly]
    E -- No, lost or collector expired session --> H[Collector holds stale or empty cache]
    H --> I[Data records reference unknown template IDs]
    I --> J[Records dropped or decoded wrong]
    J --> K[Forensic gap until next template refresh]

Common causes

CauseWhat it looks likeFirst thing to check
Exporter reboot or upgradeDecoded flows drop to zero; exporter sysUpTime recently resetPoll sysUpTime and correlate with the decoded-flow gap
Collector restartAll exporters affected simultaneously; template cache lostCheck whether template cache persists across restarts
Template timeout mismatchPeriodic gaps at a fixed interval for one exporterCompare exporter refresh interval to collector session expiry
NAT or failover changing source IPTemplates arrive from a “new” source IP; old session orphanedCompare packet source IP against template cache entries
UDP packet loss dropping template datagramsIntermittent, correlated with high traffic or buffer pressureCheck Udp_RcvbufErrors and NIC RX drops
Schema change after vendor upgradeNew vendor-specific IEs appear; decoder errors with unknown fieldsCheck collector version support and vendor changelog

Quick checks

These are safe, read-only commands to narrow the problem quickly.

# Check whether the exporter recently rebooted (sysUpTime discontinuity)
snmpget -v2c -c <community> <exporter> .1.3.6.1.2.1.1.3.0

# Check collector logs for template-related errors
grep -iE 'template|cache miss|unable to decode' /var/log/<collector>.log | tail -50

# Verify that UDP flow packets are still arriving at the collector
tcpdump -i eth0 -nn 'udp port 2055' -c 100   # NetFlow v5/v9
tcpdump -i eth0 -nn 'udp port 4739' -c 100   # IPFIX

# Check for UDP socket buffer drops (template packets may be among the drops)
cat /proc/net/snmp | grep '^Udp:'
nstat -az Udp_RcvbufErrors

# Check current receive buffer fill on the flow listener
ss -lun '( sport = :2055 )' -m

# On Cisco: check exporter-side statistics
ssh <exporter> 'show flow exporter statistics'

# On Cisco: check device-side export and drop counters via SNMP
<!-- TODO: verify these exact OIDs against CISCO-NETFLOW-MIB -->
snmpget -v2c -c <community> <exporter> .1.3.6.1.4.1.9.9.387.1.4.4   # cnfESPktsExported
snmpget -v2c -c <community> <exporter> .1.3.6.1.4.1.9.9.387.1.4.6   # cnfESPktsDropped

# If using nfdump, inspect raw records for template references
nfdump -R /var/nfdump/ -o raw | head -20

How to diagnose it

  1. Confirm the exporter rebooted. Poll sysUpTime at .1.3.6.1.2.1.1.3.0. A value lower than the previous poll means a reboot or SNMP agent restart. Correlate the reboot timestamp with the onset of the decoded-flow gap. If you have coldStart or warmStart traps, confirm the timing matches.

  2. Verify packets are arriving but records are not decoding. Check the collector’s flow-packets-received counter. If it is nonzero and rising but decoded-records is zero or anomalously low, the issue is template resolution, not transport. Compare device-side export rate against collector inbound rate: if the device is exporting but the collector shows zero decoded records, templates are the problem.

  3. Check for template error messages. Search collector logs for “template”, “cache miss”, or “unable to decode”. The specific template ID in the error message tells you which template definition the collector is missing.

  4. Check for UDP drops that may have consumed template packets. If Udp_RcvbufErrors is incrementing, template datagrams may be among the dropped packets. Template packets travel at the same UDP priority as data packets. Under buffer pressure, template loss is just as likely as data loss, but its impact is broader: losing one template packet invalidates all subsequent data records until the next refresh.

  5. Check for NAT or source IP changes. Templates are bound to the (source IP, observation domain ID) pair. If the exporter’s source IP changed after a failover or NAT remap, the collector may have orphaned templates for the old IP and no templates for the new one. Compare the source IP in current packets against the template cache.

  6. Check collector restart coincidence. If all exporters are affected simultaneously, the collector likely restarted and lost its in-memory template cache. Not all collectors persist templates to disk. Check whether your collector has template cache persistence configured.

  7. Verify template refresh intervals. The exporter’s template refresh interval must be shorter than the collector’s template expiry interval. If the exporter sends templates every 30 minutes but the collector expires sessions after 30 minutes, the system is permanently on the edge of desync. RFC 5153 (IPFIX Implementation Guidelines) explicitly calls out this interoperability risk.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
sysUpTime discontinuityDetects the reboot that triggered the desyncAny decrease from previous poll
Flow packets received rateConfirms UDP transport is intactRising while decoded records are zero
Flow records decodedPrimary indicator of template cache healthZero or anomalously low while received is nonzero
Flow exporter drop rate (cnfESPktsDropped)Device-side drops mean templates never left the exporterAny increment
Udp_RcvbufErrorsKernel dropped datagrams, possibly including template packetsAny nonzero increment on a flow collector
Collector inbound vs device exported rateEnd-to-end loss detection across the UDP pathDevice exported consistently higher than collector inbound
Template cache hit/miss ratioDirect measure of cache freshnessRising miss rate for a specific exporter
Exporter syslog: reboot or config changeCorrelation window for desync onsetConfig commit or reload event preceding the gap

Fixes

Force immediate template retransmission

The fastest way to close the gap is to make the exporter resend templates immediately. On Cisco IOS, temporarily reduce the template timeout interval to 1 minute:

ip flow-export template timeout-rate 1

Use timeout-rate (minutes), not refresh-rate (packets between retransmissions). After confirming the collector has received and cached the new templates, restore the production interval.

On Cisco IOS XR, the default Data Template Timeout and Options Template Timeout are both 1800 seconds. Reduce the timeout in the exporter-map temporarily to force faster retransmission, then restore after the collector has rebuilt its cache.

Restart the collector with cache persistence

If the collector lost its template cache on restart and does not persist templates to disk, the fix depends on the platform. nfdump and similar tools require explicit cache persistence configuration; verify whether yours has it enabled.

FastNetMon stores template caches on disk under /var/cache/fastnetmon. After an exporter change, operators must delete and recreate this directory to purge stale templates. This is destructive: it flushes all cached templates for all exporters, not just the affected one.

For distributed collectors like Cribl Stream, template state is replicated from a Leader Node to Worker Nodes. If the Leader goes offline, Workers retain their local caches but do not receive new or updated templates. Workers that lack the updated template will drop flowsets that reference it. Verify Leader Node health and confirm Workers have received replicated state after any exporter change.

Fix template timeout mismatch

Set the exporter’s template refresh interval shorter than the collector’s template expiry interval. If using libfixbuf (the reference IPFIX collector library used by many open-source collectors), sessions expire after 30 minutes of inactivity. When no packet arrives from a session for 30 minutes, libfixbuf frees all templates for that session. The exporter’s template data timeout must be set below this threshold. Operators commonly reduce it to 300 seconds or less.

On MikroTik RouterOS, the default v9-template-timeout is 20 minutes. This is close enough to libfixbuf’s 30-minute session expiry to be fragile under any packet delay. Explicitly set v9-template-refresh and v9-template-timeout to values well below 30 minutes when using libfixbuf-based collectors.

Address UDP loss masking template loss

At high flow rates (100,000 to 300,000 flows per second), UDP socket buffer exhaustion can drop template packets before the decoder sees them. This creates apparent template desync even when the exporter is correctly transmitting templates. Check Udp_RcvbufErrors and the Linux net.core.rmem_max setting.

Production deployments should set net.core.rmem_max to 16 MB or higher, with 33 MB for very high-volume collectors. Ensure the collector explicitly sets SO_RCVBUF on the listener socket.

For Splunk Stream, the correct tuning knob for decoding throughput is netflowReceiver.N.decodingThreads, not processingThreads. Under-scaled decoding threads cause internal queue saturation (“Netflow processing queues are full”) and large packet drops that can include template datagrams.

For broader context on UDP flow data loss and kernel buffer tuning, see Silent UDP flow data loss: why your NetFlow collector is dropping records.

Handle vendor-specific Information Elements after upgrades

After a vendor software upgrade, new template fields may appear. NetFlow v9 Information Element IDs above 346 are Cisco-proprietary. Some collectors handle these by mapping them to enterprise ID 9999 for decoding. If the collector does not recognize the new fields, it may log decoding errors (typically hex dumps of unrecognized data) rather than silently dropping. Verify the collector version supports the new template fields and update if necessary.

Prevention

  • Monitor sysUpTime resets. Every reboot is a potential template desync event. Alert on sysUpTime discontinuity so the team knows to expect a flow data gap and can verify recovery.
  • Set exporter template refresh well below collector expiry. A 5-minute refresh interval with a 30-minute collector expiry provides ample margin. RFC 5153 warns about the risk when the exporter interval approaches or exceeds the collector interval.
  • Verify template cache persistence. Know whether your collector survives restarts with its template cache intact. If not, plan for a warmup gap of up to one full template refresh interval after every collector restart.
  • Monitor decoded-records versus received-packets ratio. A sudden divergence is the earliest signal of template loss. Alert when decoded records drop to zero while received packets remain nonzero.
  • Monitor Leader-to-Worker replication in distributed topologies. A silent Leader failure leaves Workers with stale templates and no path to receive updates.
  • Document collector-specific cache purge procedures. Know the exact purge command for your collector before you need it in an incident.

How Netdata helps

  • Netdata’s SNMP collector polls sysUpTime (.1.3.6.1.2.1.1.3.0) at 1-second resolution, giving immediate visibility into reboot events that correlate with flow data gaps.
  • The cnfESPktsExported and cnfESPktsDropped counters from CISCO-NETFLOW-MIB let you compare device-side export rates against collector-side receive rates, confirming whether templates are leaving the exporter and whether packets are being dropped in transit.
  • System-level monitoring of Udp_RcvbufErrors in /proc/net/snmp catches the kernel-level drops that can consume template datagrams before they reach the decoder.
  • Per-core CPU monitoring, including softirq time, identifies RSS misconfiguration that funnels all flow traffic to a single core, slowing packet processing and dropping template packets.
  • Correlating sysUpTime resets with flow collector inbound rate changes across the same time window pinpoints the desync window and its duration for postmortem analysis.