Kubernetes monitoring checklist: the signals every production cluster needs

This article is a reference checklist for senior engineers who are wiring up, auditing, or hardening monitoring for a production Kubernetes cluster. It assumes you already understand the control plane architecture and focuses on what to collect, where to find it, and which symptoms matter. Use it during greenfield instrumentation, post-incident gap analysis, or routine health audits.

The signals are grouped by domain. Each entry leads with a short noun phrase, followed by one sentence explaining why it matters, and a concrete warning sign to alert on. Thresholds are drawn from upstream SLOs, kubelet defaults, and etcd operational limits documented in the Kubernetes source and production playbooks. If you run a managed service such as EKS, GKE, or AKS, treat control-plane metrics as provider-mediated; many etcd and API server internals are opaque in those environments.

Control plane and etcd

API server liveness. Confirms the kube-apiserver process is accepting connections. Warning sign: non-200 response or timeout greater than 5 seconds on /livez.

API server readiness. Verifies etcd connectivity, informer sync, and post-start hooks. Warning sign: /readyz fails while /livez passes, indicating initialization deadlock or etcd loss.

etcd leader stability. Raft leadership churn directly blocks writes and causes brief outages. Warning sign: etcd_server_leader_changes_seen_total increments more than once per hour without maintenance activity.

etcd WAL fsync latency. Every etcd write fsyncs to disk; slow storage cascades into API latency. Warning sign: etcd_disk_wal_fsync_duration_seconds p99 greater than 100 ms sustained.

etcd database size. Approaching the quota triggers a NOSPACE alarm and makes the cluster read-only. Warning sign: etcd_debugging_mvcc_db_total_size_in_bytes or etcd_mvcc_db_total_size_in_bytes greater than 80 percent of --quota-backend-bytes.

API request latency by verb. Elevated mutating latency stalls controllers, kubectl, and CI pipelines. Warning sign: apiserver_request_duration_seconds p99 greater than 1 s for POST/PUT/PATCH sustained, or LIST p99 greater than 30 s.

Admission webhook latency. Synchronous webhook calls add directly to mutating request latency. Warning sign: apiserver_admission_webhook_admission_duration_seconds p99 greater than 200 ms for any webhook with failurePolicy: Fail.

API Priority and Fairness queue depth. Queued requests indicate a priority level is saturated and critical traffic may be delayed. Warning sign: apiserver_flowcontrol_current_inqueue_requests greater than 0 for the system or leader-election priority levels.

API server error rate. 5xx errors indicate etcd, webhook, or internal failures; 429s indicate APF throttling. Warning sign: apiserver_request_total with code=~"5.." or code="429" sustained above baseline.

Inflight requests. Approaching the hard limit causes 429 rejections and client retry storms. Warning sign: apiserver_current_inflight_requests greater than 80 percent of --max-requests-inflight or --max-mutating-requests-inflight.

Watch event throughput. High event rates indicate rapid cluster churn that can overload informers. Warning sign: apiserver_watch_events_total spiking greater than 10 times baseline.

Node and kubelet health

Node Ready condition. The kubelet’s self-reported ability to run pods. Warning sign: Ready=False or Ready=Unknown for more than 1 minute.

PLEG relist latency. Slow container runtime queries precede NotReady transitions. Warning sign: kubelet_pleg_relist_duration_seconds p99 greater than 10 s or approaching the 3-minute unhealthy threshold.

Kubelet sync loop duration. Slow reconciliation delays pod creation, termination, and status updates. Warning sign: kubelet_sync_loop_duration_seconds p99 greater than 30 s sustained.

Container runtime connectivity. Runtime disconnection blocks all container operations on the node. Warning sign: crictl info hangs or fails, or the kubelet /healthz endpoint reports runtime unhealthy.

Image pull duration. Slow pulls extend pod startup time and scale-out response. Warning sign: kubelet_image_pull_duration_seconds p99 greater than 2 minutes for typical images.

Node memory pressure. Triggers pod eviction and kernel OOM kills. Warning sign: MemoryPressure=True or available memory below 100 Mi (default hard eviction threshold).

Node disk pressure. Blocks new image pulls and triggers pod eviction. Warning sign: DiskPressure=True, or nodefs/imagefs utilization above 85 percent.

PID pressure. Exhaustion prevents fork and blocks container startup. Warning sign: PIDPressure=True, or node PID usage above 80 percent of /proc/sys/kernel/pid_max.

Kubelet certificate expiration. Expired client certificates break API server authentication. Warning sign: kubelet_certificate_manager_client_ttl_seconds below 7 days, or any rotation error counter incrementing.

Kubelet error rate. Logs surface runtime, API, or certificate problems before they become node failures. Warning sign: greater than 50 errors per hour, or any panic message.

Workloads and scheduling

Pod phase distribution. Pending or Unknown pods indicate scheduling failures or lost nodes. Warning sign: pods in Pending longer than 10 minutes, or Unknown phase increasing.

CrashLoopBackOff. Persistent container crashes indicate application bugs, OOMs, or misconfiguration. Warning sign: any pod in CrashLoopBackOff for more than 5 minutes.

Container restart count. Rising restarts signal instability before CrashLoopBackOff. Warning sign: restart count increasing by more than 5 in 10 minutes for production workloads.

OOM kill events. Memory limits exceeded or node-level pressure kills containers. Warning sign: lastState.terminated.reason=OOMKilled, or kernel dmesg showing OOM kills.

CPU throttling. CFS quota exhaustion degrades application latency. Warning sign: container_cpu_cfs_throttled_periods_total ratio greater than 25 percent for latency-sensitive pods.

Scheduler pending pods. Growing unschedulable queue means capacity or constraint exhaustion. Warning sign: scheduler_pending_pods with queue="unschedulableQ" growing for more than 5 minutes.

Controller workqueue depth. Backlog indicates the control plane is falling behind on reconciliation. Warning sign: workqueue_depth greater than 100 sustained for core controllers such as deployment or node.

Deployment rollout health. Stuck rollouts leave workloads under-replicated. Warning sign: readyReplicas less than spec.replicas or ProgressDeadlineExceeded for more than 5 minutes.

Networking and DNS

kube-proxy sync duration. Long syncs stale rules and hold the iptables lock in iptables mode. Warning sign: kubeproxy_sync_proxy_rules_duration_seconds p99 greater than 10 s, or approaching the sync period.

Conntrack table utilization. A full table silently drops new connections across all node traffic. Warning sign: nf_conntrack_count greater than 90 percent of nf_conntrack_max, or the drop counter incrementing in conntrack -S.

kube-proxy healthz. Failure indicates API server disconnect or rule programming failure. Warning sign: /healthz returning 503 for more than 1 minute.

Service endpoint readiness. Zero endpoints means the Service has no healthy backends. Warning sign: zero available endpoints for a production Service, or endpoint count dropping more than 50 percent in 5 minutes.

CoreDNS health. DNS failures cascade to all inter-service traffic. Warning sign: CoreDNS pods not Ready, coredns_dns_responses_total with rcode="SERVFAIL" greater than 1 percent, or p99 latency greater than 500 ms.

Network policy enforcement. Silent drops from failing CNI plugins or misconfigured policies. Warning sign: connection timeouts between pods that should be allowed, with CNI pod restarts or policy sync delays.

Storage and volumes

PVC binding status. Pending PVCs block pod startup. Warning sign: any production PVC in Pending for more than 5 minutes.

PV utilization. Full volumes cause write failures and potential corruption. Warning sign: kubelet_volume_stats_used_bytes greater than 90 percent of capacity, or inode usage greater than 90 percent.

Volume mount latency. Stuck mounts block pods in ContainerCreating indefinitely. Warning sign: storage_operation_duration_seconds greater than 2 minutes, or pods stuck ContainerCreating with mount events.

Security and certificates

Control plane certificate expiration. Expiry breaks all TLS communication between components. Warning sign: any control plane, etcd, or kubelet certificate less than 30 days until expiration.

API server authentication failures. Mass 401s indicate certificate rotation failure or attack. Warning sign: apiserver_request_total with code="401" spiking greater than 10 times baseline.

RBAC modification rate. Unexpected privilege grants indicate compromise or misconfiguration. Warning sign: new cluster-admin bindings or clusterrolebindings created outside change windows.

Anonymous API requests. Successful anonymous access indicates misconfiguration or exposure. Warning sign: system:anonymous requests returning 200 for non-health endpoints.

Privileged container creation. Host-namespace access increases attack surface and escape risk. Warning sign: pods with privileged: true, hostNetwork: true, or hostPID: true outside system namespaces.

Audit log gaps. Missing logs break forensics and may indicate backend failure. Warning sign: unexplained gaps greater than 5 minutes in audit output during active cluster use.

How Netdata helps

  • Netdata collects kubelet, container, and node metrics in real time without central query latency.
  • It correlates API server latency with etcd disk latency and admission webhook latency on the same timeline.
  • It tracks cgroup-level CPU throttling, memory usage, and disk I/O per pod to pinpoint noisy neighbors.
  • It surfaces conntrack utilization, kube-proxy sync duration, and CoreDNS latency alongside workload health so you can move from symptom to cause without switching tools.