$ guides / network / network-audit-log-gap ▌

Operations Guides

Audit log gaps: detecting syslog/trap tampering or loss

An audit log gap is any period where expected syslog messages or SNMP traps from a network device fail to arrive at the collector. UDP syslog on port 514 and SNMP traps on port 162 are fire-and-forget transports with no delivery guarantee. The kernel silently drops datagrams when socket buffers fill, and the application layer never sees the loss. TCP syslog can stall under collector backpressure. Most gaps are operational: network loss, device-side buffer overflow, or logging subsystem failure. The difficulty is distinguishing those from deliberate log suppression after compromise.

Tamper-evident logging requires cryptographic signing, and few network device platforms support this natively. Detection relies on indirect signals: per-device rate analysis, heartbeat monitoring, sequence number analysis where available, and cross-source correlation. The goal is to build enough independent evidence to either rule out operational causes or escalate to a security investigation.

What constitutes a gap

A gap is a statistically significant departure from the expected cadence for a specific device. A core switch generating 200 syslog messages per hour has a different gap signature than a distribution switch generating 5 per hour. Detection requires per-device baselines, not a single global threshold.

Recommended thresholds for gap detection:

Gap exceeding 2x the poll interval from any device generates a TICKET.
Gap exceeding 1 hour from a critical device escalates to PAGE.
Heartbeat missing for more than 5 minutes from a device that should be sending one is PAGE.
Syslog rate dropping to 0 from a device for more than 30 minutes, while other devices still report, indicates the device is isolated or its syslog agent failed.

After detection, the diagnostic questions are: was the device reachable during the gap, was the gap isolated or fleet-wide, did the collector process stay up, and did anything change on the device or in the transport path during the window.

The hardest case is the gap with no obvious cause. ICMP is healthy. SNMP responds. Other devices on the same collector report normally. The device simply stopped sending syslog or traps for a period and then resumed. This pattern requires checking UDP buffer drops, collector health, device-side logging configuration, and config-change logs.

flowchart TD
    A["Gap detected in syslog/trap stream"] --> B{"Multiple devices affected?"}
    B -->|"Yes"| C["Collector-side or transport issue"]
    B -->|"No"| D{"Device reachable via ICMP and SNMP?"}
    D -->|"No"| E["Network partition or device outage"]
    D -->|"Yes"| F{"Udp_RcvbufErrors incrementing?"}
    F -->|"Yes"| G["Socket buffer overflow"]
    F -->|"No"| H{"Config changes during gap window?"}
    H -->|"Yes"| I["Investigate: unauthorized or accidental change"]
    H -->|"No"| J["Check device logging subsystem and buffer"]

Common causes

Cause	What it looks like	First thing to check
UDP socket buffer overflow	Rate drops during high-traffic windows; ICMP to device healthy; resumes after burst subsides	`Udp_RcvbufErrors` in `/proc/net/snmp`
Device-side log buffer overflow	Gap from one device; others normal; device was generating high log volume before the gap	Device logging buffer config; severity distribution before gap
Collector outage or restart	Gap from all devices simultaneously; collector process restarted or disk filled	Collector process state; disk queue persistence config
Transport partition (firewall or ACL change)	Gap from devices behind a specific network path; ICMP may or may not be affected	ICMP and SNMP to affected devices; firewall rule changes
Logging subsystem failure on device	Device reachable via ICMP and SNMP but produces no syslog or traps	Device logging config; local log buffer on device
Deliberate suppression (post-compromise)	Gap during or after incident window; config changes around same time; missing heartbeats with device reachable	Config change logs; AAA logs; sequence number analysis
Time skew corrupting timestamps	Data exists but appears in wrong time bucket in the time-series view	NTP offset on device; syslog header timestamps

Quick checks

# Check system-wide UDP socket buffer drops (syslog 514, traps 162, flows all use UDP)
cat /proc/net/snmp | grep '^Udp:'
nstat -az Udp_RcvbufErrors

# Check syslog listener buffer fill and limits
ss -lun '( sport = :514 )' -m

# Check trap listener buffer fill
ss -lun '( sport = :162 )' -m

# Detect per-minute time gaps in syslog across all device logs
# Assumes ISO 8601 timestamps in the log; adjust field parsing for your format
awk -F'[T:]' '{print $2":"$3}' /var/log/network-devices/*.log | uniq -c

# Verify syslog traffic is arriving on the wire
tcpdump -i eth0 -nn 'udp port 514' -c 100

# Verify trap traffic is arriving on the wire
tcpdump -i eth0 -nn 'udp port 162' -c 100

# Confirm device was reachable during the gap window
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.1.3.0

# Check for config changes during the gap window (Cisco example)
grep -E '%SYS-5-CONFIG_I|configured from' /var/log/network-devices/*.log

# Check AAA logs for authentication events during the gap
grep -E 'fail|denied' /var/log/tacacs.log
grep -E 'authentication fail|login fail|priv-lvl' /var/log/network-devices/*.log

# Check rsyslog internal stats (if impstats module is loaded)
grep -i 'impstats\|discarded\|fromhost' /var/log/rsyslog.log | tail -50

All commands above are read-only and safe to run on a production collector. None modify the logging pipeline.

How to diagnose it

Confirm the gap is real. Parse the device’s log stream into time buckets (per-minute or per-5-minute). A gap is a bucket with zero messages where the historical average is nonzero. Use the per-device baseline. A quiet device going quiet is not a gap. A chatty device going quiet is.
Check device reachability during the gap window. Was ICMP healthy? Did SNMP polls succeed? If both were down, the gap is network or device outage, not a logging problem. If both were healthy, the logging path is isolated from the data plane, which narrows the cause.
Scope the gap. Did it affect one device, a subset, or all devices on the collector? A single-device gap points to device-side issues or transport issues specific to that device. A fleet-wide gap points to collector-side failure or a broad transport change.
Check UDP buffer drops. Run nstat -az Udp_RcvbufErrors and look for increments during the gap window. Any nonzero value means the kernel discarded datagrams because the application was too slow to drain the socket buffer. This is the most common cause of silent syslog and trap loss.
Check collector health. Was the collector process running during the gap? If using rsyslog with disk-assisted queues, verify the queue did not overflow. If using syslog-ng with the stats() flag enabled, query its internal counters via the unix socket for dropped messages. A collector restart without persistent queue state loses in-flight messages.
Check for config changes. Look at config-change syslog messages (for example, %SYS-5-CONFIG_I on Cisco) and AAA logs for the gap window. A config change to logging settings, ACLs, or management interfaces during the gap is a strong signal of either accidental or deliberate modification. Correlate every config change with a change ticket.
Check transport path changes. Firewall rule changes, routing changes, or NAT modifications can break the syslog or trap path without affecting ICMP or SNMP. Check firewall logs and routing tables for the gap window.
Evaluate tampering indicators. If all operational causes are excluded, escalate to a security investigation. Key indicators: the gap coincides with an incident window, config changes occurred during the gap without a change ticket, authentication events appear in AAA logs from unexpected sources, or the device’s running config has been modified to redirect or suppress logging. Sequence number analysis, where the syslog transport supports it, can distinguish a dropped message from a deleted one.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Syslog message receipt rate per source device	A rate drop to zero from a normally chatty device is the primary gap signal	Rate = 0 for more than 30 minutes when other devices still report
SNMP trap receive rate per source device	Traps are UDP fire-and-forget; total silence from a normally noisy device means the trap path is broken	Rate = 0 from a device that previously sent traps regularly
UDP socket buffer drops (`Udp_RcvbufErrors`)	The only kernel-level signal for silent UDP datagram loss; applies to syslog 514, traps 162, and flow ports	Any nonzero increment; rate proportional to incoming packet rate means chronic undersize
Syslog source count	Number of distinct devices actively sending; a drop below the expected count is a fleet-wide gap signal	Count drops below expected baseline
Heartbeat receipt from devices	A missing heartbeat with the device reachable via ICMP and SNMP means logging is disabled or the heartbeat service failed	Heartbeat missing for more than 5 minutes
Collector process CPU and memory	Parser saturation causes backpressure that looks like device-side gaps	Collector CPU above 90% sustained; parser thread pool exhausted
NTP offset on monitored devices	Time skew corrupts timestamps, making gaps appear where data exists or hiding gaps in wrong time buckets	Offset above 100 ms sustained on a critical device
Config change events (syslog plus AAA)	Changes to logging configuration, ACLs, or management interfaces during a gap window are the primary tampering signal	Any config change without a change ticket, especially during an incident window

Fixes

UDP socket buffer overflow

The default net.core.rmem_max is often 212992 bytes (about 208 KB) on upstream kernel defaults, though some distributions ship higher values. This is inadequate for collectors receiving high-volume syslog or trap streams. Production deployments should target 16 MB or higher.

# WARNING: These are live system changes affecting all UDP sockets on the host.
# They do not persist across reboot.
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.rmem_default=16777216

# Persistent: add to /etc/sysctl.d/ and run sysctl --system
# net.core.rmem_max=16777216
# net.core.rmem_default=16777216

The collector must also explicitly set SO_RCVBUF on its listener socket to take advantage of the higher system limit. Verify with ss -lun '( sport = :514 )' -m that the buffer limit reflects the new setting.

Tradeoff: a larger buffer absorbs bursts but also increases memory consumption per socket. On a host running multiple UDP listeners (syslog, traps, flows), the memory cost compounds.

Device-side log buffer overflow

Devices with high log rates can overflow their own internal buffers before sending. The syslog stream shows a gap but the device was generating logs the entire time. Mitigation depends on the platform: increase the device’s logging buffer size, raise the severity threshold for buffered logs, or enable TCP-based syslog where supported to get flow control.

Collector resilience

If the collector restarts or loses its downstream connection, in-flight logs are lost unless disk-assisted queuing is configured. In rsyslog, configure the action.queue with disk persistence. In syslog-ng, enable disk buffer destinations. Without disk queues, a collector outage of even a few seconds creates a gap that is indistinguishable from tampering during later investigation.

Transport hardening

For environments where audit log integrity is regulated, move syslog from UDP 514 to TCP 6514 with TLS mutual authentication. Both rsyslog 8.x and syslog-ng 3.x and later support this. TLS provides transport-layer integrity and authentication, eliminating the spoofing and injection risks inherent to UDP syslog. SNMPv3 with authPriv should replace SNMPv1/v2c community-string transport for traps.

Tradeoff: TLS over TCP adds round-trip latency and requires certificate management on every device. For devices that do not support TLS syslog, use TCP syslog without TLS as an intermediate step: TCP at least provides delivery confirmation and flow control that UDP does not.

Deliberate suppression

If operational causes are excluded, treat the gap as a security event. Preserve all logs from the gap window, including from adjacent devices that may have logged traffic to or from the affected device. Diff the device’s running configuration against a known-good baseline. Check AAA logs for authentication events from unexpected sources. Escalate to the security team.

Known vulnerability-driven log clearing exists in the wild. Track CVEs that target logging subsystems or allow post-exploitation log manipulation on your specific device platforms.

Prevention

Enable heartbeat logging on critical devices. Many network devices can be configured to emit a periodic syslog message at a fixed interval. A missing heartbeat is the cleanest gap detection signal because it is independent of event-driven log volume.
Monitor Udp_RcvbufErrors continuously. This counter is system-wide across all UDP sockets and is the earliest indicator of silent loss. Alert on any nonzero value on a syslog or trap collector.
Track per-device syslog and trap receive rates. A rate that drops to zero from a normally chatty device is the primary signal. Baseline per device, not globally. A device sending 5 messages per hour has a very different gap threshold than one sending 200.
Use disk-assisted queuing on the collector. This prevents gaps during collector restarts or downstream outages. Verify that queue state persists across process restarts.
Monitor NTP on monitored devices. Clock skew corrupts the timestamps that make gap detection possible. Alert on offset above 100 ms on critical devices. Check via hrSystemDate at .1.3.6.1.2.1.25.1.2.0 or vendor-specific NTP MIBs.
Log config changes with attribution. Every config change should produce a syslog event with user, source IP, and timestamp. Alert on changes outside maintenance windows. Correlate config changes with subsequent gaps.
Implement append-only storage for critical log streams. SIEM platforms that support write-once indexes or legal hold features prevent post-hoc modification of stored logs. Even if an attacker suppresses live transport, previously stored records remain intact.

How Netdata helps

UDP buffer drop monitoring. Netdata collects Udp_RcvbufErrors from /proc/net/snmp by default. Alert on any nonzero increment on a syslog or trap collector host.
Per-core CPU and softirq monitoring. A single core pinned at 100% from RSS misconfiguration can cause the parser thread to stall and the socket buffer to overflow. Netdata exposes per-core CPU metrics out of the box, including softirq time.
Socket buffer visibility. Correlate socket queue depth with incoming packet rate to detect a buffer approaching its limit before drops begin.
Process monitoring. Netdata tracks collector process CPU, memory, and file descriptors. A rsyslog or syslog-ng process consuming 100% CPU during a burst is the precursor to silent log loss.
Cross-signal correlation. During a gap investigation, correlate syslog receive rate, UDP buffer drops, collector CPU, and device ICMP/SNMP reachability in a single timeline. Combining device reachability, other log sources, and config changes in one view is what distinguishes operational loss from tampering.

The Netdata solution

Network monitoring with Netdata

Netdata monitors network infrastructure with per-second interface metrics, SNMP, NetFlow/sFlow/IPFIX, and ML anomaly detection. Correlate interface flapping, packet drops, routing changes, and traffic spikes with the systems that depend on them.

See network monitoring → Start monitoring free

Audit log gaps: detecting syslog/trap tampering or loss

Audit log gaps: detecting syslog/trap tampering or loss

What constitutes a gap

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

UDP socket buffer overflow

Device-side log buffer overflow

Collector resilience

Transport hardening

Deliberate suppression

Prevention

How Netdata helps

Related guides

Network monitoring with Netdata