The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents

Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents

Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud

Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises

Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI

Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

The future of infrastructure observability

See our strategic direction across AI-native observability, full-stack signals, operational intelligence, and enterprise platform maturity.

AI-native observability
Full-stack signal coverage
Operational intelligence
Enterprise platform maturity
Agent releases every 6 weeks
Cloud continuous delivery
> Explore Product Roadmap

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

AI-native observability

Continuous delivery

Open source foundation

80% Faster Incident Resolution

AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.

True Real-Time and Simple, even at Scale

Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.

90% Cost Reduction, Full Fidelity

Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.

See and Map Your Entire Network

Live topology, flow analytics, and SNMP device and trap monitoring — unified with your full-stack observability.

Control Without Surrender

SOC 2 Type 2 certified with every metric kept on your infrastructure.

Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged

Because 3am alerts deserve instant answers, not hour-long hunts.

Every Industry Has Rules. We Master Them.

See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.

Monitor Any Technology. Configure Nothing.

Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans

What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying

Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation

Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner

Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Get the Homelab Plan

$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
Free Video Course

8-episode Netdata tutorial by LearnLinux.tv

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

AI Support Assistant, Available 24/7

Nedi has access to all official documentation, source code, and resources. Ask any question about Netdata—responds in your language.

Deployment & configuration
Troubleshooting & sizing
Alerts & notifications
Evidence-based answers
> Ask Nedi now

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now

60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation

Level Up Your Monitoring

Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.

76,000+ Engineers Strong

615+ contributors. 1.5M daily downloads. One mission: simplify observability.

Per-Second. 90% Cheaper. Data Stays Home.

Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Meet the Team Behind Netdata
Conferences, meetups, and tradeshows where you can see Netdata in action and talk to the engineers who build it.
Live demos and deep dives
Book 1-on-1 meetings
Talks and panel sessions
Event recaps and photos
> See all events
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

Blog

Fleet observability: how to monitor thousands of edge Linux devices

How distributed, edge-resident architecture monitors robots, kiosks, EV chargers, and IoT gateways behind NAT, on cellular, and through outages — backed by measured numbers.
by Netdata Team · June 28, 2026

It feels less like managing devices and more like remote babysitting. You check the dashboard, everything is green, and then a customer in the field tells you a device has been down for two days. At a handful of servers, the rare failure is an event. Across thousands of distributed Linux endpoints — robots in warehouses, EV chargers across a city, kiosks in retail, IoT gateways in the field — the rare failure becomes a daily occurrence, and the tools built for a datacenter quietly stop telling you the truth.

Fleet observability is full per-device visibility across that population, rolled up so a small team can run a fleet far larger than itself. It is a different problem from monitoring a Kubernetes cluster, and it needs a different architecture: edge-resident instead of centralized, outbound-only instead of pull-based, and store-and-forward instead of always-connected.

This walks through what makes fleet observability hard, the architecture that handles it, and the numbers you can expect — no comparison tables, just the failures fleets hit and how Netdata answers each.

At a glance

For operators who already know they need this and want the numbers:

  • Agent Footprint: depending on metrics collected and architecture, 80-200 MiB RAM, 1-2% of a CPU core per 1k metrics/s, 1-5 KB/s egress, tunable at every dimension.
  • Streaming bandwidth: ~1KB/s sustained every ~1k metrics per device, scaling linearly with metric count.
  • Energy efficiency: the University of Amsterdam’s peer-reviewed ICSOC 2023 study found Netdata the most energy-efficient monitoring solution tested, even with per-second collection and edge machine learning.
  • Architecture: push-native, outbound-only streaming; store-and-forward with double-buffered RAM pages. Can keep its database entirely in RAM (zero time-series disk I/O) on flash-constrained devices, while still monitoring disk health.
  • Liveness model: every chart and aggregate shows how many nodes are contributing right now — “CPU is 47%, from 4,723 of 5,000 devices” — built on the NIDL data model, with no query to write.
  • Economics: open-source agent; per-node Cloud pricing with no cardinality dimension and no egress charge; fleet pricing for large deployments. Data stays on your network — Cloud is metadata-only.

What makes fleet observability different from datacenter monitoring

Datacenter monitoring assumes stable connectivity, central servers, and effectively unlimited per-host resources. Fleet observability assumes the opposite of all three. Four constraints define the workload.

Distribution. Hundreds to tens of thousands of endpoints, not a handful of servers. Every per-device cost is multiplied by fleet size. A 200 MB agent on 5,000 devices is a terabyte of fleet RAM spent on observability alone. An agent that doubles in memory is not a 2× problem — it is a 5,000× problem.

Intermittent connectivity. Devices sit behind NAT and carrier-grade NAT. They use cellular, satellite, or metered links. Many connect outbound only. Pull-based monitoring — the cloud-native default — simply cannot reach half the fleet. Local-first is mandatory, not a nice-to-have.

Resource-constrained hardware. ARM SBCs, industrial gateways, Jetsons, commodity x86. Per-endpoint RAM and CPU budgets are deliberately modest, and the people who run these fleets treat the observability agent as a guest that must not starve the workload it is there to watch.

Per-endpoint economics. Every architectural decision compounds across the fleet. Per-host SaaS pricing at thousands of devices produces annual bills that do not match the segment’s economics, and egress on cellular is a separate, real line item.

Two mental shifts follow from these constraints. First, a real fleet is always partially degraded — expect a few percent of devices to be offline or impaired at any moment. If your dashboard shows 100% green, suspect the observability before you trust it. Second, the fleet is a population, not a machine. You reason about it epidemiologically — outbreaks, index cases, chronic conditions. A single failing device is a data point; a pattern across devices is the signal.

Distributed fleet monitoring architecture: thousands of edge agents stream to regional Netdata Parents while only metadata reaches Netdata Cloud

The fleets this architecture is built for

The same distributed-fleet shape appears across industries that look nothing alike. The observability requirements are nearly identical:

  • Robotics — AMRs, warehouse robots, drones, ROS2 fleets. On-robot Linux compute, often Jetson or industrial SBC.
  • Kiosks and POS terminals — self-service and retail endpoints across many sites. Small Linux PC or SBC, frequently with a microcontroller peripheral.
  • EV chargers — charge-point networks speaking OCPP, on embedded Linux controllers with cellular or satellite uplinks.
  • IoT and industrial gateways — edge gateways, SCADA front-ends, environmental sensing on Raspberry Pi, Jetson, or industrial-grade hardware.
  • School-safety and field devices — embedded Linux endpoints on campuses or remote sites.
  • MSP-managed Linux estates — thousands of commodity Linux servers spread across many client environments.

The pattern is consistent: hundreds to tens of thousands of endpoints, behind NAT, on intermittent links, run by a small team that needs full per-device visibility and fleet-wide rollups at the same time.

Knowing what’s online: liveness is the first job

The most common failure mode in fleet observability isn’t a bad metric. It’s silence.

Dashboards show 100% green because 15% of the fleet stopped reporting and nobody noticed. The remaining 85% look fine. Alerts don’t fire because missing data isn’t treated as a signal. This is the “all green but sick” problem, and it is the main reason fleet operators lose trust in their monitoring — the device is green on the dashboard and dead in the field.

Netdata addresses this at the data-model level. Every chart, every dashboard, every aggregate — from “fleet CPU average” to “per-device memory” — shows how many nodes are contributing data, right on the chart, without writing a query. Not “CPU is 47%.” It’s “CPU is 47%, from 4,723 of 5,000 devices.” Because the data is edge-resident, the chart knows whether the full set is reporting or not — and the same holds whether that set is fifty devices or fifty thousand, because the architecture scales linearly as you add Parents.

Fleet liveness with the NIDL model: every chart shows how many of the fleet’s nodes are reporting, so silent devices are surfaced and grouped by label instead of being averaged out of the picture

This is the NIDL framework — Nodes, Instances, Dimensions, Labels — the data model under every Netdata visualization. Every dropdown on every chart shows the count of contributing time-series, their volume-contribution percentage, their anomaly rate, and min/avg/max over the visible window. You don’t write a query to find which devices are missing; you see it.

Each child agent’s connection is also tracked on the Parent as its own set of metrics. A per-child state series moves through running, replicating, waiting, offline, and archived; the age of the current state is tracked in seconds, and streaming traffic in and out is measured per child. If a group of devices goes dark, you know in seconds — and the picture is filterable by any label.

Group offline devices by location=warehouse-3 and you spot a site outage. Group by firmware=v1.2.1 and you catch a bad OTA. Group by device_type=wireless-bridge and you find the cellular tower that went down. The epidemiological reasoning fleet ops demands happens through the same dropdowns, not through custom queries. Netdata also ships a built-in node-disconnected health check; route it to your notification channel so a silent device pages you instead of waiting for a customer to call.

The disk problem that kills fleets

Flash storage failure is the most common long-term hardware failure in distributed fleets. The community-reported experience of “~500 SD cards fail even with industrial” out of a 7,000-device Raspberry Pi deployment is not theoretical — it’s an operator living it.

The mechanism is well understood. Embedded flash has finite program/erase cycles. Write amplification multiplies every byte written by 4–8× at the physical layer. A monitoring agent that writes to disk every second — metrics, logs, state files — continuously burns those cycles. Eventually the filesystem remounts read-only, the device still looks healthy on CPU, RAM, and link, and nobody knows until an application crashes on a failed write.

Netdata can be configured to keep its database entirely in RAM, eliminating time-series disk writes altogether ([db].mode = ram). The agent buffers metrics in memory and streams them to a Parent, so the device that is most at risk from flash wear stops contributing to it — while Netdata keeps monitoring disk health (via disk statistics and SMART) regardless of where its own database lives. And for devices that do keep local history, Netdata stores it remarkably compactly — years of history in a few gigabytes, at roughly 0.6 bytes per sample.

Why flash wear kills fleet devices: an agent that writes every second burns finite program/erase cycles until the filesystem remounts read-only, while keeping the Netdata database in RAM removes time-series disk writes and still monitors disk health

Connectivity: works behind NAT, and offline

Fleet endpoints are behind NAT. They’re on cellular. They connect outbound only. This is not a misconfiguration; it’s the architecture of the network. “Never trust cellular connectivity; always assume you’ll lose the connection — local-first is mandatory” is how operators who run vehicle fleets describe it.

Most monitoring systems assume reachability. Scraping systems need each endpoint to expose a port and accept inbound connections. Even many push-based SaaS agents assume a stable link. Neither pattern survives the fleet reality: devices that appear, disappear, switch networks, drop to 2G, or go silent for days.

Netdata’s streaming is push-native and outbound-only. A child agent opens an outbound connection to a Netdata Parent. There is no inbound port and no scrape target, so NAT and carrier-grade NAT are not obstacles. The child can fall back between interfaces and switch providers and IPs; as long as it can reach a Parent, it streams. When a child reconnects, it backfills the gap from its local history, so your charts have no holes — by default it catches up the last day, and you can configure how far back it reaches.

Store-and-forward streaming: the edge agent keeps recording through a WAN outage and backfills the gap when connectivity returns, so the charts have no holes — a centralized agent loses the window entirely

Because a child streams continuously, the Parent notices the moment the data stops — a powered-off device or a dropped NAT mapping shows up right away as a child gone silent, not as a phantom connection that lingers. Operators choose their compression — ZSTD by default, with LZ4, Brotli, and GZIP available — to trade CPU against bandwidth.

Netdata also handles the reconnect storm — the moment after a regional cellular outage when thousands of devices come back simultaneously. On the agent side, reconnects use a randomized delay (jitter) within a fixed window and rotate across available Parents; on the Parent side, a waiting queue admits connections and replication in a controlled way. The central tier absorbs the surge instead of collapsing under it. The trade is a few minutes of reconnection latency for reliable recovery.

The edge agent footprint

When the goal is to monitor a fleet of thousands of nodes, footprint is the master constraint. Every byte of RAM, every CPU percent, every bit of egress is multiplied by the device count.

Netdata is batteries-included by default: hundreds of collectors auto-detect at install, dashboards are auto-generated, and zero YAML is required to see the first chart. What the Netdata Agent costs you per device, as a promise you can hold us to:

  • 80–200 MiB RAM, set by how many metrics you collect and how much history you keep on the device.
  • 1–2% of a CPU core per 1,000 metrics/s collected — and halving the collection frequency halves it.
  • 1–5 KB/s streamed to your Parent, scaling with the number of metrics — and all of it stays on your network.

Curate down to the few hundred metrics that actually matter and the footprint shrinks with them. It stays small even on 32-bit ARM hardware, where a large share of fleet devices live.

The University of Amsterdam’s independent, peer-reviewed ICSOC 2023 study found Netdata to be the most energy-efficient monitoring solution tested, despite collecting significantly more data than any other solution evaluated. For battery-powered or power-constrained devices, that efficiency is the difference between a device that lasts two years and one that lasts ten — and at fleet scale, field visits are the dominant cost.

Streaming bandwidth, measured in the field

On cellular, bandwidth is something you pay for — so here is the promise, measured on a customer’s embedded router running in production: a device collecting around 1,000 metrics streams at about 1 KB/s, a few hundred megabytes a month. It scales linearly with metric count, so a simple kiosk streams less and a heavier gateway proportionally more. Compression is ZSTD by default (LZ4, Brotli, and GZIP are available), and because the stream goes to your own Parent, none of it leaves your network. State your metrics-per-device number and the bandwidth follows — that is the whole calculation.

What fleet monitoring costs at scale

Cost is where fleet economics diverge hardest from the datacenter. Two structural facts drive it.

First, your data stays on your network. Your devices stream to Parents that are your own infrastructure; only metadata reaches Netdata Cloud — never your samples or logs. What leaves your network is a trickle of metadata, so the cellular-egress line item that quietly inflates cloud-monitoring bills disappears. Pricing is per node, with no cardinality dimension in the bill and no egress charge, and there is dedicated cost structure for large fleets.

Second, per-host SaaS pricing was never designed for this scale. Per-host infrastructure monitoring bills by the host every month, so the meter runs in lockstep with the fleet — a 5,000-host fleet lands north of $1M a year before any negotiated discount. The point isn’t a vendor scoreboard; it’s that any per-host model turns into a seven-figure decision the moment a fleet has thousands of devices. The full per-node breakdown is in our detailed cost comparison.

Fleet monitoring economics: per-node, edge-resident pricing stays nearly flat as the fleet grows, while shipping all telemetry to a central SaaS curves upward and becomes disqualifying at thousands of nodes

Diagnosing a device you can’t touch

When a fleet operator says “device X is slow,” the instinct is to SSH in, run top, check logs. At thousands of devices that doesn’t scale, and on a device behind NAT on a cellular link you can’t SSH at all — the alternative is rolling a truck.

Netdata’s Functions API provides live views into any child agent through the Parent or Cloud, without SSH and without inbound access:

  • Running processes — CPU, memory, open files per process
  • System logs — the systemd journal, explored remotely
  • Network connections — active sockets and listening services
  • systemd units — unit and service states
  • Mount points — the mount table and per-mount space
  • Containers — running container and cgroup state
  • IPMI sensors — hardware sensor readings on servers with a BMC

The flow is identical whether the device is next to you or a thousand miles away on a satellite link: the Parent or Cloud relays the function call to the child, and the child responds. Remote diagnosis becomes a browser tab instead of a truck roll. (Other hardware-health signals — SMART/storage wear, thermals, power, clock synchronization — are collected as metrics with alerts, rather than live functions.) Beyond functions, the NIDL dropdowns on any chart let you sort by anomaly rate or maximum value to find the single rogue instance — the one container pegged at 100% CPU, the one disk saturated — across thousands of instances in seconds.

Diagnosing an edge device without SSH: Netdata Functions relay through the Parent or Cloud to give live process, log, socket, and container views on a device behind NAT, with no inbound access and no truck roll

Central configuration at fleet scale

Changing monitoring configuration on a thousand devices normally means one of three things: SSH to each (doesn’t scale), bake a new image and redeploy (slow, risky), or never change the config at all (the most common outcome).

Netdata’s dynamic configuration (dyncfg) lets you change what a child collects from the Parent, without redeploying the agent, without SSH, and without touching the device. Enable a collector, disable one, change a collection job, adjust a health prototype. Each dyncfg-enabled configuration exposes its own JSON schema and a per-device configuration screen, so operators see exactly what each device is set to and apply changes selectively. It covers the configuration that matters for fleet operation — collector jobs and health rules — rather than every legacy file-based knob.

For OTA rollouts, the same labeling makes telemetry canaries straightforward: group devices by firmware version and watch the new cohort for regressions before the rollout completes.

Seeing the whole site, not just the endpoints

Fleet sites aren’t only endpoints. They have managed switches, routers, access points, and sensors — the infrastructure that connects the devices. When a site goes dark, the first question is whether it’s the devices or the network.

A Netdata Parent collects that infrastructure alongside the child metrics: SNMP monitoring for switches, routers, and access points; SNMP trap collection for asynchronous events; and NetFlow, sFlow, and IPFIX collection for network-traffic visibility. One Parent per site gives full visibility — every endpoint, every switch, every link — through a single pane.

What Netdata is, and what it isn’t

Netdata is an observability platform, not a device-management system. It doesn’t provision devices, push OTA firmware, or manage A/B partitions. What it does is give you full visibility into the effect of those operations — monitoring OTA rollouts by firmware cohort, catching regressions before they cascade, and providing the telemetry canary that tells you whether to proceed or roll back. For team and multi-customer access control across an estate, role-based access is provided through Netdata Cloud.

It also integrates cleanly with OpenTelemetry. The OpenTelemetry plugin listens on OTLP/gRPC (the standard port 4317) and ingests both metrics and logs from any compatible source — collectors, SDKs, instrumented apps — into the same store-and-forward pipeline as everything else. Metrics land on charts with full alerting; logs are stored as journal files and explored through the Logs tab.

For the operational reality of fleet observability — “is each of my N-thousand devices healthy right now, and which ones are failing?” — Netdata is built for the job. The architecture was designed from day one for distributed, outbound-only, intermittently-connected operation. It wasn’t adapted from a datacenter model; it was built for the edge, and it distributes the code, not the data.

How to evaluate fleet observability for your fleet

If you’re comparing options, the test is whether the architecture matches your reality. A practical checklist:

  1. Test the footprint on your smallest device. Install on the most constrained endpoint you run — the Raspberry Pi, the industrial gateway, the 512 MB ARM box — and measure actual RAM, CPU, and bandwidth under your metric profile. The agent is open source and installs with one command; the defaults are tunable, so profile with the metrics you actually need.
  2. Check whether you can tell which devices are reporting, right now, without writing a query. If you can’t, your monitoring has a blind spot that grows with the fleet. This is the single highest-leverage capability in fleet observability.
  3. Verify the connectivity model matches your links. Push from agent, outbound only? Store-and-forward for intermittent links? Keepalive to catch dead NAT mappings? If the architecture assumes reachability, half your fleet will be invisible.
  4. Confirm clock and hardware-health signals are first-class. Clock-sync state, storage wear, thermal throttling, and power state are the months-long leading indicators that prevent field visits.
  5. Check the pricing model. Per-host SaaS at thousands of devices is a seven-figure conversation; per-endpoint economics dominate the decision.

For fleets of 500+ devices, you can walk through deployment at your specific scale, connectivity profile, and metric cardinality — book a fleet-scale demo.

Distributed vs decentralized monitoring, explained →

For robots, kiosks, EV chargers, IoT gateways, and MSP-managed Linux estates at hundreds to tens of thousands of endpoints, Netdata runs an open-source agent on every device that streams to your own Netdata Parents; the Netdata Agent page covers deployment and footprint tuning, and the edge-computing architecture page covers the edge-resident data model behind everything above.

Frequently Asked Questions