Kubernetes kubelet pprof troubleshooting: capturing heap and goroutine profiles
When a node starts flapping between Ready and NotReady, or kubelet memory climbs steadily while pod count stays flat, node-level metrics like kubelet_goroutines and process_resident_memory_bytes will tell you that the kubelet process is struggling. They will not tell you whether the leak is in the PLEG relist path, the volume manager, or a probe goroutine pool. For that, you need a profile.
The kubelet exposes standard Go pprof endpoints on its authenticated HTTPS port. Because it runs as a node agent, not a Pod, the access pattern is different from profiling a containerized workload. You route through the Kubernetes API server proxy subresource or hit the node directly. This guide gives the exact commands and safety constraints for capturing heap, goroutine, CPU, and mutex profiles in production, analyzing them with go tool pprof, and avoiding the RBAC and performance pitfalls that extend incidents.
What this captures
The kubelet registers pprof handlers at /debug/pprof/ on port 10250. These endpoints expose the Go runtime’s internal profiling data.
- Heap profile (
/debug/pprof/heap): A snapshot of live heap allocations and their call stacks. Use this to find memory leaks or unexpected object retention in caches, informers, or volume managers. Look for growth insidek8s.io/kubernetes/pkg/kubeletpackages, CRI gRPC buffers, or volume manager state. - Goroutine profile (
/debug/pprof/goroutine): A dump of every goroutine’s stack trace. A healthy kubelet on a moderately busy node may run a few hundred goroutines. A count climbing into the thousands indicates a leak, often in the probe manager, pod workers, or watch handlers. - CPU profile (
/debug/pprof/profile): A sampling profile of on-CPU time. The default collection duration is 30 seconds. Use this to find hot paths during sync loops, PLEG relists, or cAdvisor housekeeping. - Mutex profile (
/debug/pprof/mutex): A contention profile for runtime mutexes. This is only available if the kubelet was started with mutex profiling enabled. When available, it reveals lock contention in the sync loop or CRI call paths.
Prerequisites
Before you collect profiles, confirm the following.
- RBAC for node proxy access. Your identity must be authorized to
geton thenodes/proxysubresource. Runkubectl auth can-i get nodes/proxy. If the answer is no, you cannot reach the kubelet through the API server proxy. - Debugging handlers enabled. The kubelet config field
enableDebuggingHandlersmust be true. In most distributions this is the default. If it is false, all/debug/pprof/endpoints return 404. - Matching kubelet binary. Heap, goroutine, and CPU profiles contain memory addresses. To translate them into function names, you need the exact kubelet binary that produced the profile. Archive node binaries at deployment time. Without the binary,
go tool pprofshows raw addresses. - Go toolchain installed locally. You need
go tool pprofon the workstation where you will analyze the profiles.
Procedure
The examples below use the API server proxy path, which is the recommended production approach. Managed Kubernetes environments typically block direct access to kubelet port 10250. The API server path inherits your existing kubectl authentication and RBAC context.
flowchart LR
A[Operator workstation] --> B{kubectl access}
B -->|API server proxy| C["kubectl get --raw
/api/v1/nodes/NODE/proxy
/debug/pprof/heap"]
B -->|Direct node access| D["curl --cert ... https://NODE:10250
/debug/pprof/heap"]
C --> E[kubelet process]
D --> E
E --> F[Binary profile file]Target the node and set the name.
NODE_NAME=worker-03Verify RBAC access.
# Confirm you can proxy to the node kubectl auth can-i get nodes/proxyConfirm pprof endpoints are reachable.
# List available profiles kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/If this returns a 404,
enableDebuggingHandlersis disabled on the kubelet. You cannot proceed without reconfiguring the kubelet and restarting it.Capture a heap profile. Heap profiles are point-in-time snapshots and are safe to collect even on busy nodes.
# Capture heap profile kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/heap > \ kubelet-heap-${NODE_NAME}-$(date +%s).profCapture a goroutine profile. Goroutine profiles are also lightweight snapshots.
# Capture goroutine profile kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/goroutine > \ kubelet-goroutine-${NODE_NAME}-$(date +%s).profCapture a CPU profile. The handler blocks for the duration of the sample. The default is 30 seconds. Do not increase this duration on a saturated node during an active incident.
# Capture CPU profile (blocks for ~30 seconds) kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/profile > \ kubelet-cpu-${NODE_NAME}-$(date +%s).profCapture a mutex profile if available. This endpoint returns a profile only if mutex profiling was enabled at kubelet startup. If it returns an empty body or error, skip this step.
# Capture mutex profile (requires runtime enablement) kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/mutex > \ kubelet-mutex-${NODE_NAME}-$(date +%s).profTransfer the kubelet binary from the node. If you do not already have the exact binary that is running on the node, copy it now. Symbol resolution fails if the build does not match. The exact path depends on your distribution; common locations are
/usr/bin/kubeletor/usr/local/bin/kubelet.# Copy the kubelet binary for local symbol resolution scp ${NODE_NAME}:/usr/bin/kubelet ./kubelet-${NODE_NAME} 2>/dev/null || \ scp ${NODE_NAME}:/usr/local/bin/kubelet ./kubelet-${NODE_NAME}
Verifying it works
After the commands return, confirm the profile files contain data. Empty files indicate a connectivity issue or a disabled endpoint.
# Check file type; valid profiles report as data
file kubelet-heap-*.prof
file kubelet-goroutine-*.prof
A valid heap profile will be several hundred kilobytes or larger on a busy node. A goroutine profile size scales with the number of active goroutines. If the file is zero bytes, verify that enableDebuggingHandlers is true and that your RBAC allows nodes/proxy.
Analyzing the profiles
Use go tool pprof on the workstation where you saved the files. Always pass the matching kubelet binary as the first argument so addresses resolve to function names.
# Heap: interactive terminal
go tool pprof ./kubelet-${NODE_NAME} ./kubelet-heap-*.prof
(pprof) top
(pprof) list kubelet.(*Kubelet).syncLoop
# Heap: launch web UI
go tool pprof -http=:8080 ./kubelet-${NODE_NAME} ./kubelet-heap-*.prof
# Goroutine: find leaking stacks
go tool pprof ./kubelet-${NODE_NAME} ./kubelet-goroutine-*.prof
(pprof) top
(pprof) traces
# CPU: identify hot paths
go tool pprof ./kubelet-${NODE_NAME} ./kubelet-cpu-*.prof
(pprof) top
(pprof) peek syncLoop
If you see hexadecimal addresses instead of function names, your binary does not match the running kubelet. You cannot recover symbols after the fact unless you archived the exact binary.
Common pitfalls
- RBAC denied on nodes/proxy. The API server proxy path is gated by RBAC. Standard cluster-admin bindings usually include this, but restricted roles do not. If
kubectl auth can-i get nodes/proxyreturns no, the API server will return 403 before the request reaches the kubelet. - Port-forwarding to a Pod instead of the node.
kubectl port-forwardtargets a Pod IP. The kubelet pprof endpoints live on the node itself. Usekubectl get --raw /api/v1/nodes/.../proxy/...orkubectl proxycombined with the node proxy subresource, notkubectl port-forward. - Symbol resolution failure. Profiles contain addresses. Without the exact kubelet binary, you cannot map addresses to functions. Save node binaries during image build or node provisioning. Managed Kubernetes control planes may not give you direct access to the kubelet binary at all, in which case you must rely on provider support channels.
- CPU profile duration risk. Collecting a CPU profile with
?seconds=120holds goroutines and consumes scheduler resources. On a kubelet that is already thrashing, this can push it over the edge. Stick to the 30-second default during incidents. - Read-only port 10255. Some older clusters still expose the unauthenticated read-only port. Do not use it for pprof access in production. It is deprecated, unauthenticated, and may be disabled by default on managed clusters. Route through the authenticated API server proxy on port 10250 instead.
- Managed provider restrictions. On GKE, EKS, and AKS, tenant workloads cannot reach kubelet port 10250 directly, and cloud-provider RBAC may prevent node proxy access entirely. If
kubectl get --rawreturns 403 at the API server level, your provider may be blocking node proxy access. Open a provider support ticket rather than attempting to bypass the control plane boundary.
Signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Kubelet goroutine count | Rising goroutines are a leading indicator of leaks before OOM | go_goroutines > 500 or growing over days |
| Kubelet RSS memory | Memory pressure precedes eviction or OOM kill | process_resident_memory_bytes > 80% of container limit |
| PLEG relist duration | Slow relist often correlates with goroutine or memory pressure | p99 relist duration > 10 seconds sustained |
| Kubelet CPU usage | High CPU may indicate contention or sync loop saturation | > 1 core sustained without corresponding pod churn |
How Netdata helps
- Netdata scrapes kubelet metrics such as
go_goroutines,go_memstats_heap_inuse_bytes, andprocess_resident_memory_bytes, giving you the trend lines that justify capturing a pprof profile. - Node-level cgroup charts show kubelet CPU throttling and memory usage against its container limit, helping you distinguish a resource-starved kubelet from an internal leak.
- Correlating kubelet memory growth with
MemAvailableand disk latency on the same node confirms whether the pressure originates inside the kubelet process or from the host environment.
Related guides
- How the Kubernetes control plane works: a mental model for operators: /guides/kubernetes/how-kubernetes-control-plane-works/
- Kubernetes anonymous API access: detection, audit, and lockdown: /guides/kubernetes/kubernetes-anonymous-access-detection/
- Kubernetes API server audit logging: policy, backends, and forensics: /guides/kubernetes/kubernetes-api-server-audit-logging/
- Kubernetes API server certificate rotation: detection and grace handling: /guides/kubernetes/kubernetes-api-server-certificate-rotation/
- Kubernetes API server etcd latency: detection and cascading failures: /guides/kubernetes/kubernetes-api-server-etcd-latency/
- Kubernetes API server FlowSchemas and PriorityLevels: design and tuning: /guides/kubernetes/kubernetes-api-server-flow-schemas/
- Kubernetes API server memory pressure: OOM cycle and tuning: /guides/kubernetes/kubernetes-api-server-memory-pressure/
- Kubernetes API server rate limiting: APF priority levels and starvation: /guides/kubernetes/kubernetes-api-server-rate-limited/
- Kubernetes API server slow or unresponsive: causes and fixes: /guides/kubernetes/kubernetes-api-server-slow/
- Kubernetes API server watch storm: re-list cascades and connection floods: /guides/kubernetes/kubernetes-api-server-watch-storm/
- Kubernetes bound service account tokens: rotation, audience, and expiry: /guides/kubernetes/kubernetes-bound-service-account-tokens/
- Kubernetes conntrack exhaustion: dropped connections under load: /guides/kubernetes/kubernetes-conntrack-exhaustion/






