NTP drift on network devices: the silent killer of event correlation

Clock drift on network devices produces no visible symptom. The device stays up, interfaces carry traffic, BGP sessions remain Established, SNMP keeps responding. The damage surfaces hours or days later, in a postmortem where two devices’ timestamps disagree by hundreds of milliseconds and the analyst cannot reconstruct the event sequence. Every cross-device correlation in the monitoring stack depends on accurate, monotonic time across every collector and every polled device.

The telemetry itself looks fine. Syslog messages arrive with timestamps. Flow records carry timestamps. BGP NOTIFICATION traps are time-stamped. The problem is that those timestamps are wrong relative to each other, and nothing flags the discrepancy. A device 200 milliseconds off its peers produces records that technically arrive but correlate poorly with records from correctly synchronized devices.

What NTP drift is and why it breaks correlation

NTP drift is the gradual divergence of a device’s system clock from authoritative time. Under healthy conditions, NTP maintains synchronization to within approximately one millisecond on a LAN. The protocol slews rather than steps the clock for small offsets: it adjusts oscillator frequency gradually instead of jumping the time. When the offset exceeds a threshold (commonly 128ms in ntpd), the daemon steps the clock, producing a visible discontinuity in logs and flow records. This slewing behavior means drift can accumulate to hundreds of milliseconds before any single correction becomes visible to an operator.

NTP operates exclusively over UDP port 123. There is no TCP fallback. If a firewall, control-plane filter, or routing policy blocks UDP 123 in either direction, the device stops synchronizing and its clock free-runs on the local oscillator. The quality of that oscillator determines how fast the clock drifts. A typical TCXO drifts 1-2 ppm (roughly 86-172 ms/day); a standard crystal can be 10-50 ppm (0.8-4.3 seconds/day).

The threshold for operational impact is lower than most operators expect. The NPM playbook calls for a TICKET at 100 milliseconds of NTP offset and a PAGE at 1 second. Distributed databases enforce maximum clock offset limits (CockroachDB defaults to 500ms). Even 200 milliseconds between two devices on the same link flap is enough to break flow-to-syslog correlation in a postmortem reconstruction.

How clock skew corrupts downstream telemetry

Every event that leaves a network device carries a timestamp derived from the device’s local clock. Syslog headers, NetFlow v9 and IPFIX records, sFlow samples, and SNMP trap varbinds all embed the device’s notion of the current time. When two devices disagree about what time it is, their events cannot be reliably ordered, and the ordering errors compound across every correlation the monitoring platform attempts.

flowchart TD
    A["Device oscillator drifts
without NTP correction"] --> B["Syslog, flow, trap
timestamps diverge from peers"] B --> C["Collector ingests data
with skewed timestamps"] C --> D{"Cross-device
correlation"} D -->|"Clocks aligned"| E["Events ordered correctly
Postmortem succeeds"] D -->|"Offset over 200ms"| F["Causality reversed
Events look unrelated"] F --> G["Root cause unidentified
Postmortem fails"]

Consider a link flap that triggers a BGP session reset. The expected sequence is: interface transitions to down, BGP hold timer expires, session drops, route withdrawal propagates. If the switch reporting the interface-down event has a clock 800 milliseconds behind the router reporting the BGP drop, the timestamps suggest the session dropped before the interface failed. The postmortem analyst sees causality reversed and investigates the wrong root cause.

The same mechanism breaks flow correlation across sites. Matching the same 5-tuple at two collectors requires a temporal join window. If the collectors’ clocks disagree by even a few hundred milliseconds, the join window misses matching records. The flow appears to traverse one edge but not the other, and multi-hop path reconstruction returns inconsistent results.

Moderate drift also breaks adjacent systems. A five-minute offset puts the device clock outside the validity window of TLS certificates, breaking automated certificate renewal, CRL fetching, and OCSP stapling. The operator sees a PKI enrollment failure, not a clock problem. License-window calculations and time-based access controls have the same dependency on accurate device time.

Where it shows up in production

ScenarioWhat it looks likeRoot cause
Post-power-outage oscillator driftLarge timestamp discontinuity in logs after device recoveryLocal oscillator drifted during extended power loss; device accepts a large step correction on NTP re-sync
VM cold-start time driftVirtualized appliance exports stale timestamps after live migration or snapshot restoreGuest NTP daemon has not re-converged; exported telemetry is already wrong before correction
Juniper control-plane filter blocking NTPshow ntp associations shows no reachable peers despite NTP being configuredLoopback input filter blocks self-originated UDP 123 replies (lo0 to lo0)
Cisco IOS XR authentication key gapDevice loses NTP synchronization silently; no trap, no syslog escalationNTP authentication keys expired without overlapping replacement keys configured
Cross-collector timestamp divergenceEvents from devices behind different collectors cannot be correlatedCollectors have independent NTP sources with offset of seconds between them
Drift file loss after reimagingCold-start frequency hunt lasting minutes to hoursMissing drift file forces the daemon to relearn oscillator frequency error from scratch

The cross-collector divergence scenario deserves particular attention. When multiple SIEM collectors or log aggregators have independent NTP sources, a three-second offset between two collectors makes it impossible to correlate events from devices behind each collector. This is a physical-timing problem masquerading as a software issue. The fix is NTP consistency across the collector fleet, not SIEM normalization or wider join windows.

Why device-side NTP goes unmonitored

Operators monitor their own infrastructure’s NTP but not the NTP state of the devices they poll. The collector’s chronyc tracking output may be on a dashboard, but the router’s stratum, clock offset, and peer reachability are not. Several factors compound the gap:

Limited SNMP support. Many basic switches do not implement hrSystemDate (.1.3.6.1.2.1.25.1.2.0), the standard HOST-RESOURCES-MIB OID for device time. Without it, there is no programmatic way to check the clock without a CLI scrape. Syslog header timestamps can serve as a proxy signal, but parsing them requires deliberate instrumentation that most teams have not built.

Set-and-forget culture. NTP configuration is written once, during initial device provisioning, and never revisited. The assumption is that if NTP was configured, it works. In practice, NTP associations fail silently when peers become unreachable, when authentication keys expire, when control-plane filters are applied during a security hardening pass, or when the management VRF changes.

The drift file dependency. Both ntpd and chrony record measured oscillator frequency error in a drift file: /var/lib/ntp/drift for ntpd, /var/lib/chrony/drift for chrony on most Linux distributions. After a reboot or reimage, this file allows faster lock-in by restoring the learned frequency correction. Without it, the daemon enters a cold-start frequency hunt that can last minutes to hours, during which every exported timestamp is wrong. Loss of this file is not monitored by default.

The 2036 rollover risk. NTP timestamps use a 32-bit seconds field with epoch 1 January 1900. The first rollover occurs at 06:28:16 UTC on 7 February 2036. NTPv4 includes mechanisms to handle this, but older ntpd implementations and 32-bit embedded network device firmware may lose synchronization. Devices running unsupported firmware are a latent risk that will surface as a fleet-wide synchronization failure rather than a single-device drift.

Signals to watch in production

SignalWhy it mattersWarning sign
Cross-collector NTP offsetCorrelation between collectors breaks when their clocks divergeOffset over 100ms sustained (TICKET); over 1s (PAGE)
Device-side clock via hrSystemDateDevice timestamps corrupt all downstream telemetryOffset over 100ms from authoritative time
NTP stratum on monitored devicesHigh stratum means the device is far from its reference and more susceptible to driftStratum 16 (unsynchronized); any increase from baseline
NTP peer reachability via CLILoss of all peers means the clock is free-running on its local oscillatorReachability octal of 0 in show ntp associations
Clock offset in show ntp statusDirect measurement of how far the device clock has driftedOffset over 100ms warrants investigation
Syslog header vs. collector receive timeProxy for device clock skew when NTP MIBs are unavailableConsistent delta over 500ms between header timestamp and ingest time
chronyc tracking / ntpq -p on collectorsCollector drift corrupts every telemetry stream it processesOffset field over 100ms
Post-reboot NTP re-sync stateDevices frequently lose sync after reboot and must re-convergesysUpTime reset without corresponding NTP synchronization within minutes

The last signal is easy to add to an existing monitoring pipeline. If you already track sysUpTime for reboot detection, correlate a reset event with an NTP offset check immediately after. A device that has been up for less than five minutes and shows an offset over 100 milliseconds is still converging. A device that has been up for an hour and still shows the same offset has a broken NTP association.

How Netdata helps

  • Collect chrony and ntpq metrics on hosts running the Netdata agent, giving continuous visibility into collector clock offset and peer state.
  • Poll hrSystemDate (.1.3.6.1.2.1.25.1.2.0) on monitored network devices through SNMP data collection to surface device-side clock drift that would otherwise remain invisible until a postmortem fails.
  • Correlate timestamp discontinuities with sysUpTime resets to distinguish a device reboot (where a clock reset is expected) from an NTP source change (where a step correction may indicate a broken association).
  • Alert on NTP offset thresholds: 100ms for warning, 1s for page-level escalation.
  • Apply anomaly detection to clock offset to catch gradual slew that accumulates below static thresholds before a large step correction produces a visible timestamp jump.