NTP drift on network devices: the silent killer of event correlation

Clock drift on network devices produces no visible symptom. The device stays up, interfaces carry traffic, BGP sessions remain Established, SNMP keeps responding. The damage surfaces hours or days later, in a postmortem where two devices’ timestamps disagree by hundreds of milliseconds and the analyst cannot reconstruct the event sequence. Every cross-device correlation in the monitoring stack depends on accurate, monotonic time across every collector and every polled device.

The telemetry itself looks fine. Syslog messages arrive with timestamps. Flow records carry timestamps. BGP NOTIFICATION traps are time-stamped. The problem is that those timestamps are wrong relative to each other, and nothing flags the discrepancy. A device 200 milliseconds off its peers produces records that technically arrive but correlate poorly with records from correctly synchronized devices.

What NTP drift is and why it breaks correlation

NTP drift is the gradual divergence of a device’s system clock from authoritative time. Under healthy conditions, NTP maintains synchronization to within approximately one millisecond on a LAN. The protocol slews rather than steps the clock for small offsets: it adjusts oscillator frequency gradually instead of jumping the time. When the offset exceeds a threshold (commonly 128ms in ntpd), the daemon steps the clock, producing a visible discontinuity in logs and flow records. This slewing behavior means drift can accumulate to hundreds of milliseconds before any single correction becomes visible to an operator.

NTP operates exclusively over UDP port 123. There is no TCP fallback. If a firewall, control-plane filter, or routing policy blocks UDP 123 in either direction, the device stops synchronizing and its clock free-runs on the local oscillator. The quality of that oscillator determines how fast the clock drifts. A typical TCXO drifts 1-2 ppm (roughly 86-172 ms/day); a standard crystal can be 10-50 ppm (0.8-4.3 seconds/day).

The threshold for operational impact is lower than most operators expect. The NPM playbook calls for a TICKET at 100 milliseconds of NTP offset and a PAGE at 1 second. Distributed databases enforce maximum clock offset limits (CockroachDB defaults to 500ms). Even 200 milliseconds between two devices on the same link flap is enough to break flow-to-syslog correlation in a postmortem reconstruction.

How clock skew corrupts downstream telemetry

Every event that leaves a network device carries a timestamp derived from the device’s local clock. Syslog headers, NetFlow v9 and IPFIX records, sFlow samples, and SNMP trap varbinds all embed the device’s notion of the current time. When two devices disagree about what time it is, their events cannot be reliably ordered, and the ordering errors compound across every correlation the monitoring platform attempts.

flowchart TD
    A["Device oscillator drifts
without NTP correction"] --> B["Syslog, flow, trap
timestamps diverge from peers"]
    B --> C["Collector ingests data
with skewed timestamps"]
    C --> D{"Cross-device
correlation"}
    D -->|"Clocks aligned"| E["Events ordered correctly
Postmortem succeeds"]
    D -->|"Offset over 200ms"| F["Causality reversed
Events look unrelated"]
    F --> G["Root cause unidentified
Postmortem fails"]

Consider a link flap that triggers a BGP session reset. The expected sequence is: interface transitions to down, BGP hold timer expires, session drops, route withdrawal propagates. If the switch reporting the interface-down event has a clock 800 milliseconds behind the router reporting the BGP drop, the timestamps suggest the session dropped before the interface failed. The postmortem analyst sees causality reversed and investigates the wrong root cause.

The same mechanism breaks flow correlation across sites. Matching the same 5-tuple at two collectors requires a temporal join window. If the collectors’ clocks disagree by even a few hundred milliseconds, the join window misses matching records. The flow appears to traverse one edge but not the other, and multi-hop path reconstruction returns inconsistent results.

Moderate drift also breaks adjacent systems. A five-minute offset puts the device clock outside the validity window of TLS certificates, breaking automated certificate renewal, CRL fetching, and OCSP stapling. The operator sees a PKI enrollment failure, not a clock problem. License-window calculations and time-based access controls have the same dependency on accurate device time.

Where it shows up in production

Scenario	What it looks like	Root cause
Post-power-outage oscillator drift	Large timestamp discontinuity in logs after device recovery	Local oscillator drifted during extended power loss; device accepts a large step correction on NTP re-sync
VM cold-start time drift	Virtualized appliance exports stale timestamps after live migration or snapshot restore	Guest NTP daemon has not re-converged; exported telemetry is already wrong before correction
Juniper control-plane filter blocking NTP	`show ntp associations` shows no reachable peers despite NTP being configured	Loopback input filter blocks self-originated UDP 123 replies (lo0 to lo0)
Cisco IOS XR authentication key gap	Device loses NTP synchronization silently; no trap, no syslog escalation	NTP authentication keys expired without overlapping replacement keys configured
Cross-collector timestamp divergence	Events from devices behind different collectors cannot be correlated	Collectors have independent NTP sources with offset of seconds between them
Drift file loss after reimaging	Cold-start frequency hunt lasting minutes to hours	Missing drift file forces the daemon to relearn oscillator frequency error from scratch

The cross-collector divergence scenario deserves particular attention. When multiple SIEM collectors or log aggregators have independent NTP sources, a three-second offset between two collectors makes it impossible to correlate events from devices behind each collector. This is a physical-timing problem masquerading as a software issue. The fix is NTP consistency across the collector fleet, not SIEM normalization or wider join windows.

Why device-side NTP goes unmonitored

Operators monitor their own infrastructure’s NTP but not the NTP state of the devices they poll. The collector’s chronyc tracking output may be on a dashboard, but the router’s stratum, clock offset, and peer reachability are not. Several factors compound the gap:

Limited SNMP support. Many basic switches do not implement hrSystemDate (.1.3.6.1.2.1.25.1.2.0), the standard HOST-RESOURCES-MIB OID for device time. Without it, there is no programmatic way to check the clock without a CLI scrape. Syslog header timestamps can serve as a proxy signal, but parsing them requires deliberate instrumentation that most teams have not built.

Set-and-forget culture. NTP configuration is written once, during initial device provisioning, and never revisited. The assumption is that if NTP was configured, it works. In practice, NTP associations fail silently when peers become unreachable, when authentication keys expire, when control-plane filters are applied during a security hardening pass, or when the management VRF changes.

The drift file dependency. Both ntpd and chrony record measured oscillator frequency error in a drift file: /var/lib/ntp/drift for ntpd, /var/lib/chrony/drift for chrony on most Linux distributions. After a reboot or reimage, this file allows faster lock-in by restoring the learned frequency correction. Without it, the daemon enters a cold-start frequency hunt that can last minutes to hours, during which every exported timestamp is wrong. Loss of this file is not monitored by default.

The 2036 rollover risk. NTP timestamps use a 32-bit seconds field with epoch 1 January 1900. The first rollover occurs at 06:28:16 UTC on 7 February 2036. NTPv4 includes mechanisms to handle this, but older ntpd implementations and 32-bit embedded network device firmware may lose synchronization. Devices running unsupported firmware are a latent risk that will surface as a fleet-wide synchronization failure rather than a single-device drift.

Signals to watch in production

Signal	Why it matters	Warning sign
Cross-collector NTP offset	Correlation between collectors breaks when their clocks diverge	Offset over 100ms sustained (TICKET); over 1s (PAGE)
Device-side clock via `hrSystemDate`	Device timestamps corrupt all downstream telemetry	Offset over 100ms from authoritative time
NTP stratum on monitored devices	High stratum means the device is far from its reference and more susceptible to drift	Stratum 16 (unsynchronized); any increase from baseline
NTP peer reachability via CLI	Loss of all peers means the clock is free-running on its local oscillator	Reachability octal of 0 in `show ntp associations`
Clock offset in `show ntp status`	Direct measurement of how far the device clock has drifted	Offset over 100ms warrants investigation
Syslog header vs. collector receive time	Proxy for device clock skew when NTP MIBs are unavailable	Consistent delta over 500ms between header timestamp and ingest time
`chronyc tracking` / `ntpq -p` on collectors	Collector drift corrupts every telemetry stream it processes	Offset field over 100ms
Post-reboot NTP re-sync state	Devices frequently lose sync after reboot and must re-converge	`sysUpTime` reset without corresponding NTP synchronization within minutes

The last signal is easy to add to an existing monitoring pipeline. If you already track sysUpTime for reboot detection, correlate a reset event with an NTP offset check immediately after. A device that has been up for less than five minutes and shows an offset over 100 milliseconds is still converging. A device that has been up for an hour and still shows the same offset has a broken NTP association.

How Netdata helps

Collect chrony and ntpq metrics on hosts running the Netdata agent, giving continuous visibility into collector clock offset and peer state.
Poll hrSystemDate (.1.3.6.1.2.1.25.1.2.0) on monitored network devices through SNMP data collection to surface device-side clock drift that would otherwise remain invisible until a postmortem fails.
Correlate timestamp discontinuities with sysUpTime resets to distinguish a device reboot (where a clock reset is expected) from an NTP source change (where a step correction may indicate a broken association).
Alert on NTP offset thresholds: 100ms for warning, 1s for page-level escalation.
Apply anomaly detection to clock offset to catch gradual slew that accumulates below static thresholds before a large step correction produces a visible timestamp jump.

The Netdata solution

Network monitoring with Netdata

Netdata monitors network infrastructure with per-second interface metrics, SNMP, NetFlow/sFlow/IPFIX, and ML anomaly detection. Correlate interface flapping, packet drops, routing changes, and traffic spikes with the systems that depend on them.

See network monitoring → Start monitoring free

NTP drift on network devices: the silent killer of event correlation

NTP drift on network devices: the silent killer of event correlation

What NTP drift is and why it breaks correlation

How clock skew corrupts downstream telemetry

Where it shows up in production

Why device-side NTP goes unmonitored

Signals to watch in production

How Netdata helps

Related guides

Network monitoring with Netdata