Kubernetes kubelet pprof troubleshooting: capturing heap and goroutine profiles

When a node starts flapping between Ready and NotReady, or kubelet memory climbs steadily while pod count stays flat, node-level metrics like kubelet_goroutines and process_resident_memory_bytes will tell you that the kubelet process is struggling. They will not tell you whether the leak is in the PLEG relist path, the volume manager, or a probe goroutine pool. For that, you need a profile.

The kubelet exposes standard Go pprof endpoints on its authenticated HTTPS port. Because it runs as a node agent, not a Pod, the access pattern is different from profiling a containerized workload. You route through the Kubernetes API server proxy subresource or hit the node directly. This guide gives the exact commands and safety constraints for capturing heap, goroutine, CPU, and mutex profiles in production, analyzing them with go tool pprof, and avoiding the RBAC and performance pitfalls that extend incidents.

What this captures

The kubelet registers pprof handlers at /debug/pprof/ on port 10250. These endpoints expose the Go runtime’s internal profiling data.

  • Heap profile (/debug/pprof/heap): A snapshot of live heap allocations and their call stacks. Use this to find memory leaks or unexpected object retention in caches, informers, or volume managers. Look for growth inside k8s.io/kubernetes/pkg/kubelet packages, CRI gRPC buffers, or volume manager state.
  • Goroutine profile (/debug/pprof/goroutine): A dump of every goroutine’s stack trace. A healthy kubelet on a moderately busy node may run a few hundred goroutines. A count climbing into the thousands indicates a leak, often in the probe manager, pod workers, or watch handlers.
  • CPU profile (/debug/pprof/profile): A sampling profile of on-CPU time. The default collection duration is 30 seconds. Use this to find hot paths during sync loops, PLEG relists, or cAdvisor housekeeping.
  • Mutex profile (/debug/pprof/mutex): A contention profile for runtime mutexes. This is only available if the kubelet was started with mutex profiling enabled. When available, it reveals lock contention in the sync loop or CRI call paths.

Prerequisites

Before you collect profiles, confirm the following.

  • RBAC for node proxy access. Your identity must be authorized to get on the nodes/proxy subresource. Run kubectl auth can-i get nodes/proxy. If the answer is no, you cannot reach the kubelet through the API server proxy.
  • Debugging handlers enabled. The kubelet config field enableDebuggingHandlers must be true. In most distributions this is the default. If it is false, all /debug/pprof/ endpoints return 404.
  • Matching kubelet binary. Heap, goroutine, and CPU profiles contain memory addresses. To translate them into function names, you need the exact kubelet binary that produced the profile. Archive node binaries at deployment time. Without the binary, go tool pprof shows raw addresses.
  • Go toolchain installed locally. You need go tool pprof on the workstation where you will analyze the profiles.

Procedure

The examples below use the API server proxy path, which is the recommended production approach. Managed Kubernetes environments typically block direct access to kubelet port 10250. The API server path inherits your existing kubectl authentication and RBAC context.

flowchart LR
    A[Operator workstation] --> B{kubectl access}
    B -->|API server proxy| C["kubectl get --raw
/api/v1/nodes/NODE/proxy
/debug/pprof/heap"] B -->|Direct node access| D["curl --cert ... https://NODE:10250
/debug/pprof/heap"] C --> E[kubelet process] D --> E E --> F[Binary profile file]
  1. Target the node and set the name.

    NODE_NAME=worker-03
    
  2. Verify RBAC access.

    # Confirm you can proxy to the node
    kubectl auth can-i get nodes/proxy
    
  3. Confirm pprof endpoints are reachable.

    # List available profiles
    kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/
    

    If this returns a 404, enableDebuggingHandlers is disabled on the kubelet. You cannot proceed without reconfiguring the kubelet and restarting it.

  4. Capture a heap profile. Heap profiles are point-in-time snapshots and are safe to collect even on busy nodes.

    # Capture heap profile
    kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/heap > \
      kubelet-heap-${NODE_NAME}-$(date +%s).prof
    
  5. Capture a goroutine profile. Goroutine profiles are also lightweight snapshots.

    # Capture goroutine profile
    kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/goroutine > \
      kubelet-goroutine-${NODE_NAME}-$(date +%s).prof
    
  6. Capture a CPU profile. The handler blocks for the duration of the sample. The default is 30 seconds. Do not increase this duration on a saturated node during an active incident.

    # Capture CPU profile (blocks for ~30 seconds)
    kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/profile > \
      kubelet-cpu-${NODE_NAME}-$(date +%s).prof
    
  7. Capture a mutex profile if available. This endpoint returns a profile only if mutex profiling was enabled at kubelet startup. If it returns an empty body or error, skip this step.

    # Capture mutex profile (requires runtime enablement)
    kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/debug/pprof/mutex > \
      kubelet-mutex-${NODE_NAME}-$(date +%s).prof
    
  8. Transfer the kubelet binary from the node. If you do not already have the exact binary that is running on the node, copy it now. Symbol resolution fails if the build does not match. The exact path depends on your distribution; common locations are /usr/bin/kubelet or /usr/local/bin/kubelet.

    # Copy the kubelet binary for local symbol resolution
    scp ${NODE_NAME}:/usr/bin/kubelet ./kubelet-${NODE_NAME} 2>/dev/null || \
      scp ${NODE_NAME}:/usr/local/bin/kubelet ./kubelet-${NODE_NAME}
    

Verifying it works

After the commands return, confirm the profile files contain data. Empty files indicate a connectivity issue or a disabled endpoint.

# Check file type; valid profiles report as data
file kubelet-heap-*.prof
file kubelet-goroutine-*.prof

A valid heap profile will be several hundred kilobytes or larger on a busy node. A goroutine profile size scales with the number of active goroutines. If the file is zero bytes, verify that enableDebuggingHandlers is true and that your RBAC allows nodes/proxy.

Analyzing the profiles

Use go tool pprof on the workstation where you saved the files. Always pass the matching kubelet binary as the first argument so addresses resolve to function names.

# Heap: interactive terminal
go tool pprof ./kubelet-${NODE_NAME} ./kubelet-heap-*.prof
(pprof) top
(pprof) list kubelet.(*Kubelet).syncLoop

# Heap: launch web UI
go tool pprof -http=:8080 ./kubelet-${NODE_NAME} ./kubelet-heap-*.prof

# Goroutine: find leaking stacks
go tool pprof ./kubelet-${NODE_NAME} ./kubelet-goroutine-*.prof
(pprof) top
(pprof) traces

# CPU: identify hot paths
go tool pprof ./kubelet-${NODE_NAME} ./kubelet-cpu-*.prof
(pprof) top
(pprof) peek syncLoop

If you see hexadecimal addresses instead of function names, your binary does not match the running kubelet. You cannot recover symbols after the fact unless you archived the exact binary.

Common pitfalls

  • RBAC denied on nodes/proxy. The API server proxy path is gated by RBAC. Standard cluster-admin bindings usually include this, but restricted roles do not. If kubectl auth can-i get nodes/proxy returns no, the API server will return 403 before the request reaches the kubelet.
  • Port-forwarding to a Pod instead of the node. kubectl port-forward targets a Pod IP. The kubelet pprof endpoints live on the node itself. Use kubectl get --raw /api/v1/nodes/.../proxy/... or kubectl proxy combined with the node proxy subresource, not kubectl port-forward.
  • Symbol resolution failure. Profiles contain addresses. Without the exact kubelet binary, you cannot map addresses to functions. Save node binaries during image build or node provisioning. Managed Kubernetes control planes may not give you direct access to the kubelet binary at all, in which case you must rely on provider support channels.
  • CPU profile duration risk. Collecting a CPU profile with ?seconds=120 holds goroutines and consumes scheduler resources. On a kubelet that is already thrashing, this can push it over the edge. Stick to the 30-second default during incidents.
  • Read-only port 10255. Some older clusters still expose the unauthenticated read-only port. Do not use it for pprof access in production. It is deprecated, unauthenticated, and may be disabled by default on managed clusters. Route through the authenticated API server proxy on port 10250 instead.
  • Managed provider restrictions. On GKE, EKS, and AKS, tenant workloads cannot reach kubelet port 10250 directly, and cloud-provider RBAC may prevent node proxy access entirely. If kubectl get --raw returns 403 at the API server level, your provider may be blocking node proxy access. Open a provider support ticket rather than attempting to bypass the control plane boundary.

Signals to monitor

SignalWhy it mattersWarning sign
Kubelet goroutine countRising goroutines are a leading indicator of leaks before OOMgo_goroutines > 500 or growing over days
Kubelet RSS memoryMemory pressure precedes eviction or OOM killprocess_resident_memory_bytes > 80% of container limit
PLEG relist durationSlow relist often correlates with goroutine or memory pressurep99 relist duration > 10 seconds sustained
Kubelet CPU usageHigh CPU may indicate contention or sync loop saturation> 1 core sustained without corresponding pod churn

How Netdata helps

  • Netdata scrapes kubelet metrics such as go_goroutines, go_memstats_heap_inuse_bytes, and process_resident_memory_bytes, giving you the trend lines that justify capturing a pprof profile.
  • Node-level cgroup charts show kubelet CPU throttling and memory usage against its container limit, helping you distinguish a resource-starved kubelet from an internal leak.
  • Correlating kubelet memory growth with MemAvailable and disk latency on the same node confirms whether the pressure originates inside the kubelet process or from the host environment.
  • How the Kubernetes control plane works: a mental model for operators: /guides/kubernetes/how-kubernetes-control-plane-works/
  • Kubernetes anonymous API access: detection, audit, and lockdown: /guides/kubernetes/kubernetes-anonymous-access-detection/
  • Kubernetes API server audit logging: policy, backends, and forensics: /guides/kubernetes/kubernetes-api-server-audit-logging/
  • Kubernetes API server certificate rotation: detection and grace handling: /guides/kubernetes/kubernetes-api-server-certificate-rotation/
  • Kubernetes API server etcd latency: detection and cascading failures: /guides/kubernetes/kubernetes-api-server-etcd-latency/
  • Kubernetes API server FlowSchemas and PriorityLevels: design and tuning: /guides/kubernetes/kubernetes-api-server-flow-schemas/
  • Kubernetes API server memory pressure: OOM cycle and tuning: /guides/kubernetes/kubernetes-api-server-memory-pressure/
  • Kubernetes API server rate limiting: APF priority levels and starvation: /guides/kubernetes/kubernetes-api-server-rate-limited/
  • Kubernetes API server slow or unresponsive: causes and fixes: /guides/kubernetes/kubernetes-api-server-slow/
  • Kubernetes API server watch storm: re-list cascades and connection floods: /guides/kubernetes/kubernetes-api-server-watch-storm/
  • Kubernetes bound service account tokens: rotation, audience, and expiry: /guides/kubernetes/kubernetes-bound-service-account-tokens/
  • Kubernetes conntrack exhaustion: dropped connections under load: /guides/kubernetes/kubernetes-conntrack-exhaustion/