Audit log gaps: detecting syslog/trap tampering or loss
An audit log gap is any period where expected syslog messages or SNMP traps from a network device fail to arrive at the collector. UDP syslog on port 514 and SNMP traps on port 162 are fire-and-forget transports with no delivery guarantee. The kernel silently drops datagrams when socket buffers fill, and the application layer never sees the loss. TCP syslog can stall under collector backpressure. Most gaps are operational: network loss, device-side buffer overflow, or logging subsystem failure. The difficulty is distinguishing those from deliberate log suppression after compromise.
Tamper-evident logging requires cryptographic signing, and few network device platforms support this natively. Detection relies on indirect signals: per-device rate analysis, heartbeat monitoring, sequence number analysis where available, and cross-source correlation. The goal is to build enough independent evidence to either rule out operational causes or escalate to a security investigation.
What constitutes a gap
A gap is a statistically significant departure from the expected cadence for a specific device. A core switch generating 200 syslog messages per hour has a different gap signature than a distribution switch generating 5 per hour. Detection requires per-device baselines, not a single global threshold.
Recommended thresholds for gap detection:
- Gap exceeding 2x the poll interval from any device generates a TICKET.
- Gap exceeding 1 hour from a critical device escalates to PAGE.
- Heartbeat missing for more than 5 minutes from a device that should be sending one is PAGE.
- Syslog rate dropping to 0 from a device for more than 30 minutes, while other devices still report, indicates the device is isolated or its syslog agent failed.
After detection, the diagnostic questions are: was the device reachable during the gap, was the gap isolated or fleet-wide, did the collector process stay up, and did anything change on the device or in the transport path during the window.
The hardest case is the gap with no obvious cause. ICMP is healthy. SNMP responds. Other devices on the same collector report normally. The device simply stopped sending syslog or traps for a period and then resumed. This pattern requires checking UDP buffer drops, collector health, device-side logging configuration, and config-change logs.
flowchart TD
A["Gap detected in syslog/trap stream"] --> B{"Multiple devices affected?"}
B -->|"Yes"| C["Collector-side or transport issue"]
B -->|"No"| D{"Device reachable via ICMP and SNMP?"}
D -->|"No"| E["Network partition or device outage"]
D -->|"Yes"| F{"Udp_RcvbufErrors incrementing?"}
F -->|"Yes"| G["Socket buffer overflow"]
F -->|"No"| H{"Config changes during gap window?"}
H -->|"Yes"| I["Investigate: unauthorized or accidental change"]
H -->|"No"| J["Check device logging subsystem and buffer"]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| UDP socket buffer overflow | Rate drops during high-traffic windows; ICMP to device healthy; resumes after burst subsides | Udp_RcvbufErrors in /proc/net/snmp |
| Device-side log buffer overflow | Gap from one device; others normal; device was generating high log volume before the gap | Device logging buffer config; severity distribution before gap |
| Collector outage or restart | Gap from all devices simultaneously; collector process restarted or disk filled | Collector process state; disk queue persistence config |
| Transport partition (firewall or ACL change) | Gap from devices behind a specific network path; ICMP may or may not be affected | ICMP and SNMP to affected devices; firewall rule changes |
| Logging subsystem failure on device | Device reachable via ICMP and SNMP but produces no syslog or traps | Device logging config; local log buffer on device |
| Deliberate suppression (post-compromise) | Gap during or after incident window; config changes around same time; missing heartbeats with device reachable | Config change logs; AAA logs; sequence number analysis |
| Time skew corrupting timestamps | Data exists but appears in wrong time bucket in the time-series view | NTP offset on device; syslog header timestamps |
Quick checks
# Check system-wide UDP socket buffer drops (syslog 514, traps 162, flows all use UDP)
cat /proc/net/snmp | grep '^Udp:'
nstat -az Udp_RcvbufErrors
# Check syslog listener buffer fill and limits
ss -lun '( sport = :514 )' -m
# Check trap listener buffer fill
ss -lun '( sport = :162 )' -m
# Detect per-minute time gaps in syslog across all device logs
# Assumes ISO 8601 timestamps in the log; adjust field parsing for your format
awk -F'[T:]' '{print $2":"$3}' /var/log/network-devices/*.log | uniq -c
# Verify syslog traffic is arriving on the wire
tcpdump -i eth0 -nn 'udp port 514' -c 100
# Verify trap traffic is arriving on the wire
tcpdump -i eth0 -nn 'udp port 162' -c 100
# Confirm device was reachable during the gap window
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.1.3.0
# Check for config changes during the gap window (Cisco example)
grep -E '%SYS-5-CONFIG_I|configured from' /var/log/network-devices/*.log
# Check AAA logs for authentication events during the gap
grep -E 'fail|denied' /var/log/tacacs.log
grep -E 'authentication fail|login fail|priv-lvl' /var/log/network-devices/*.log
# Check rsyslog internal stats (if impstats module is loaded)
grep -i 'impstats\|discarded\|fromhost' /var/log/rsyslog.log | tail -50
All commands above are read-only and safe to run on a production collector. None modify the logging pipeline.
How to diagnose it
Confirm the gap is real. Parse the device’s log stream into time buckets (per-minute or per-5-minute). A gap is a bucket with zero messages where the historical average is nonzero. Use the per-device baseline. A quiet device going quiet is not a gap. A chatty device going quiet is.
Check device reachability during the gap window. Was ICMP healthy? Did SNMP polls succeed? If both were down, the gap is network or device outage, not a logging problem. If both were healthy, the logging path is isolated from the data plane, which narrows the cause.
Scope the gap. Did it affect one device, a subset, or all devices on the collector? A single-device gap points to device-side issues or transport issues specific to that device. A fleet-wide gap points to collector-side failure or a broad transport change.
Check UDP buffer drops. Run
nstat -az Udp_RcvbufErrorsand look for increments during the gap window. Any nonzero value means the kernel discarded datagrams because the application was too slow to drain the socket buffer. This is the most common cause of silent syslog and trap loss.Check collector health. Was the collector process running during the gap? If using rsyslog with disk-assisted queues, verify the queue did not overflow. If using syslog-ng with the
stats()flag enabled, query its internal counters via the unix socket for dropped messages. A collector restart without persistent queue state loses in-flight messages.Check for config changes. Look at config-change syslog messages (for example,
%SYS-5-CONFIG_Ion Cisco) and AAA logs for the gap window. A config change to logging settings, ACLs, or management interfaces during the gap is a strong signal of either accidental or deliberate modification. Correlate every config change with a change ticket.Check transport path changes. Firewall rule changes, routing changes, or NAT modifications can break the syslog or trap path without affecting ICMP or SNMP. Check firewall logs and routing tables for the gap window.
Evaluate tampering indicators. If all operational causes are excluded, escalate to a security investigation. Key indicators: the gap coincides with an incident window, config changes occurred during the gap without a change ticket, authentication events appear in AAA logs from unexpected sources, or the device’s running config has been modified to redirect or suppress logging. Sequence number analysis, where the syslog transport supports it, can distinguish a dropped message from a deleted one.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Syslog message receipt rate per source device | A rate drop to zero from a normally chatty device is the primary gap signal | Rate = 0 for more than 30 minutes when other devices still report |
| SNMP trap receive rate per source device | Traps are UDP fire-and-forget; total silence from a normally noisy device means the trap path is broken | Rate = 0 from a device that previously sent traps regularly |
UDP socket buffer drops (Udp_RcvbufErrors) | The only kernel-level signal for silent UDP datagram loss; applies to syslog 514, traps 162, and flow ports | Any nonzero increment; rate proportional to incoming packet rate means chronic undersize |
| Syslog source count | Number of distinct devices actively sending; a drop below the expected count is a fleet-wide gap signal | Count drops below expected baseline |
| Heartbeat receipt from devices | A missing heartbeat with the device reachable via ICMP and SNMP means logging is disabled or the heartbeat service failed | Heartbeat missing for more than 5 minutes |
| Collector process CPU and memory | Parser saturation causes backpressure that looks like device-side gaps | Collector CPU above 90% sustained; parser thread pool exhausted |
| NTP offset on monitored devices | Time skew corrupts timestamps, making gaps appear where data exists or hiding gaps in wrong time buckets | Offset above 100 ms sustained on a critical device |
| Config change events (syslog plus AAA) | Changes to logging configuration, ACLs, or management interfaces during a gap window are the primary tampering signal | Any config change without a change ticket, especially during an incident window |
Fixes
UDP socket buffer overflow
The default net.core.rmem_max is often 212992 bytes (about 208 KB) on upstream kernel defaults, though some distributions ship higher values. This is inadequate for collectors receiving high-volume syslog or trap streams. Production deployments should target 16 MB or higher.
# WARNING: These are live system changes affecting all UDP sockets on the host.
# They do not persist across reboot.
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.rmem_default=16777216
# Persistent: add to /etc/sysctl.d/ and run sysctl --system
# net.core.rmem_max=16777216
# net.core.rmem_default=16777216
The collector must also explicitly set SO_RCVBUF on its listener socket to take advantage of the higher system limit. Verify with ss -lun '( sport = :514 )' -m that the buffer limit reflects the new setting.
Tradeoff: a larger buffer absorbs bursts but also increases memory consumption per socket. On a host running multiple UDP listeners (syslog, traps, flows), the memory cost compounds.
Device-side log buffer overflow
Devices with high log rates can overflow their own internal buffers before sending. The syslog stream shows a gap but the device was generating logs the entire time. Mitigation depends on the platform: increase the device’s logging buffer size, raise the severity threshold for buffered logs, or enable TCP-based syslog where supported to get flow control.
Collector resilience
If the collector restarts or loses its downstream connection, in-flight logs are lost unless disk-assisted queuing is configured. In rsyslog, configure the action.queue with disk persistence. In syslog-ng, enable disk buffer destinations. Without disk queues, a collector outage of even a few seconds creates a gap that is indistinguishable from tampering during later investigation.
Transport hardening
For environments where audit log integrity is regulated, move syslog from UDP 514 to TCP 6514 with TLS mutual authentication. Both rsyslog 8.x and syslog-ng 3.x and later support this. TLS provides transport-layer integrity and authentication, eliminating the spoofing and injection risks inherent to UDP syslog. SNMPv3 with authPriv should replace SNMPv1/v2c community-string transport for traps.
Tradeoff: TLS over TCP adds round-trip latency and requires certificate management on every device. For devices that do not support TLS syslog, use TCP syslog without TLS as an intermediate step: TCP at least provides delivery confirmation and flow control that UDP does not.
Deliberate suppression
If operational causes are excluded, treat the gap as a security event. Preserve all logs from the gap window, including from adjacent devices that may have logged traffic to or from the affected device. Diff the device’s running configuration against a known-good baseline. Check AAA logs for authentication events from unexpected sources. Escalate to the security team.
Known vulnerability-driven log clearing exists in the wild. Track CVEs that target logging subsystems or allow post-exploitation log manipulation on your specific device platforms.
Prevention
- Enable heartbeat logging on critical devices. Many network devices can be configured to emit a periodic syslog message at a fixed interval. A missing heartbeat is the cleanest gap detection signal because it is independent of event-driven log volume.
- Monitor
Udp_RcvbufErrorscontinuously. This counter is system-wide across all UDP sockets and is the earliest indicator of silent loss. Alert on any nonzero value on a syslog or trap collector. - Track per-device syslog and trap receive rates. A rate that drops to zero from a normally chatty device is the primary signal. Baseline per device, not globally. A device sending 5 messages per hour has a very different gap threshold than one sending 200.
- Use disk-assisted queuing on the collector. This prevents gaps during collector restarts or downstream outages. Verify that queue state persists across process restarts.
- Monitor NTP on monitored devices. Clock skew corrupts the timestamps that make gap detection possible. Alert on offset above 100 ms on critical devices. Check via
hrSystemDateat.1.3.6.1.2.1.25.1.2.0or vendor-specific NTP MIBs. - Log config changes with attribution. Every config change should produce a syslog event with user, source IP, and timestamp. Alert on changes outside maintenance windows. Correlate config changes with subsequent gaps.
- Implement append-only storage for critical log streams. SIEM platforms that support write-once indexes or legal hold features prevent post-hoc modification of stored logs. Even if an attacker suppresses live transport, previously stored records remain intact.
How Netdata helps
- UDP buffer drop monitoring. Netdata collects
Udp_RcvbufErrorsfrom/proc/net/snmpby default. Alert on any nonzero increment on a syslog or trap collector host. - Per-core CPU and softirq monitoring. A single core pinned at 100% from RSS misconfiguration can cause the parser thread to stall and the socket buffer to overflow. Netdata exposes per-core CPU metrics out of the box, including softirq time.
- Socket buffer visibility. Correlate socket queue depth with incoming packet rate to detect a buffer approaching its limit before drops begin.
- Process monitoring. Netdata tracks collector process CPU, memory, and file descriptors. A rsyslog or syslog-ng process consuming 100% CPU during a burst is the precursor to silent log loss.
- Cross-signal correlation. During a gap investigation, correlate syslog receive rate, UDP buffer drops, collector CPU, and device ICMP/SNMP reachability in a single timeline. Combining device reachability, other log sources, and config changes in one view is what distinguishes operational loss from tampering.
Related guides
- Silent UDP flow data loss: why your NetFlow collector is dropping records
- Network monitoring checklist: the signals every production network needs
- NetFlow v9/IPFIX template desync: flows decoded wrong or dropped after a reboot
- Flow export-to-ingest latency: why your NetFlow data is minutes behind
- NetFlow storage sizing: how much disk your flow collector really needs







