Kubernetes controller-manager leader election failures

Your Deployment has stopped scaling. Nodes cordoned hours ago are still draining. Garbage collection is paused, and orphaned volumes are not being cleaned up. The kube-controller-manager runs these reconciliation loops, and in an HA cluster only the leader performs work. When leader election fails, the controller-manager exits, and the control plane stops acting on desired state. Existing workloads keep running, but nothing new is managed.

The kube-controller-manager coordinates through a Lease object in the coordination.k8s.io API group. The leader must renew the lease before --leader-elect-renew-deadline (default 10 seconds) elapses. The lease itself expires after --leader-elect-lease-duration (default 15 seconds). Renewal is attempted every --leader-elect-retry-period (default 2 seconds). If a write to etcd is too slow, if the API server is saturated, if RBAC is stripped, or if the election timing is misconfigured, the leader loses the lock, logs leaderelection lost, and exits. During the gap, no instance holds a valid lease, so controllers stop reconciling.

What this means

In an HA cluster, kube-controller-manager instances use the Lease named kube-controller-manager in the kube-system namespace. If the leader cannot renew, typically because a Lease update is blocked on slow etcd fsync or API server saturation, it logs:

failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded

It then logs leaderelection lost and exits. The container restarts and re-contests. This is intentional: the leader prefers to die rather than continue with a potentially stale view. Any transient write latency can trigger a full controller restart and a reconciliation pause.

A related failure mode is the leaderless window after a clean shutdown. Because the kube-controller-manager does not release the lease on exit, the old lease persists for its full TTL, typically ~15 seconds by default. During that window no leader is active even if another instance is healthy.

flowchart TD
    A[etcd disk latency spike] --> B[API server mutating latency rises]
    B --> C[Lease PUT request times out]
    C --> D[Leader fails to renew before deadline]
    D --> E[Process exits with leaderelection lost]
    E --> F[No valid lease for ~15s TTL window]
    F --> G[Controllers stop reconciling]

Common causes

CauseWhat it looks likeFirst thing to check
etcd or API server latencycontext deadline exceeded in logs during lease renewal; elevated etcd fsync or API mutating latencyetcd_disk_wal_fsync_duration_seconds and API server p99 latency
Lease timing misconfigurationPremature leadership loss under light load because the renew deadline is too close to the lease durationPod spec for --leader-elect-lease-duration and --leader-elect-renew-deadline values
RBAC denied on Lease operationsForbidden errors in controller-manager or audit logs when updating the LeaseDefault ClusterRole system:kube-controller-manager permissions on leases
Retrieving-lock timeout cascadeAfter one failed acquisition, the controller-manager cannot recover and enters CrashLoopBackOffPod restart count and logs for timed out waiting for the condition
Leaderless window after clean exitA ~15 second gap in reconciliation immediately after a rolling restart or cordon of the leaderLease renewTime relative to the event, compared against leaseDurationSeconds

Quick checks

# Check the current Lease holder and freshness
kubectl get lease kube-controller-manager -n kube-system -o yaml
# Check controller-manager pod health and restart count
kubectl get pods -n kube-system -l component=kube-controller-manager
# Check logs for leader election failures
kubectl logs -n kube-system kube-controller-manager-<node> | grep -iE "leaderelection|renew lease|failed to acquire"
# Check etcd WAL fsync latency (stacked or self-managed etcd)
curl -s http://localhost:2379/metrics | grep '^etcd_disk_wal_fsync_duration_seconds'
# Check API server mutating latency (lease renewals use PUT)
kubectl get --raw /metrics | grep 'apiserver_request_duration_seconds' | grep -E 'verb="(POST|PUT)"'
# Check default controller-manager ClusterRole for lease permissions
kubectl get clusterrole system:kube-controller-manager -o yaml | grep -C 3 leases
# Check configured leader-election flags
kubectl get pod -n kube-system -l component=kube-controller-manager -o yaml | grep -E 'leader-elect-lease-duration|leader-elect-renew-deadline'

How to diagnose it

  1. Confirm the Lease state. Run kubectl get lease kube-controller-manager -n kube-system -o yaml. Check holderIdentity to see which instance is the leader. Check whether renewTime is within leaseDurationSeconds of the current time, allowing for clock skew. If the lease has not been renewed within its TTL, there is no active leader and reconciliation is stopped.
  2. Read the controller-manager logs. Look for failed to renew lease, leaderelection lost, or timed out waiting for the condition. The exact message separates a renewal timeout from an acquisition failure. Renewal timeouts point to API server or etcd latency; acquisition timeouts point to a stuck lock or extreme latency.
  3. Check etcd and API server latency. Lease renewals are synchronous writes through the API server to etcd. If etcd WAL fsync p99 is above 100 ms or API server mutating latency is elevated, the renewal will timeout. See Kubernetes API server etcd latency: detection and cascading failures.
  4. Verify leader election flag margins. If the log shows premature loss even when the API server is healthy, compare --leader-elect-lease-duration and --leader-elect-renew-deadline. If they are set too close together, for example 25s and 20s, there is insufficient time for retries before expiration, causing repeated flapping.
  5. Check for CrashLoopBackOff from lock retrieval failure. If the pod is restarting repeatedly and logs show timed out waiting for the condition on retrieving the lock, the controller-manager may not recover automatically until the underlying latency is resolved. This matches the behavior described in GitHub issue #117922.
  6. Audit for RBAC denials. If logs or audit logs show 403 Forbidden on lease operations, verify that the controller-manager identity retains get, create, and update permissions on leases in kube-system.
  7. Map the gap to a control plane event. If the failure began during a rolling update of the control plane nodes, the leaderless window from unclean lease release may be the cause. Expect a reconciliation gap equal to the full lease TTL, typically ~15 seconds by default, per leader transition until the old lease expires.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Lease object freshnessA stale lease means no leader is reconciling staterenewTime older than leaseDurationSeconds plus a clock skew margin
Controller-manager pod restartsThe leader exits on renewal failureMore than 1 restart in 5 minutes
etcd WAL fsync latencyEvery lease renewal waits on etcd fsyncp99 > 100 ms sustained
API server mutating request latencyLease updates are PUT operations through the API serverp99 > 500 ms sustained
Controller workqueue depthIf the leader was already falling behind, depth grows before the exitDepth > 100 sustained for core queues
Audit log 403 rate on leasesRBAC denial blocks renewalAny sustained 403 on coordination.k8s.io/v1/namespaces/kube-system/leases

Fixes

If the cause is etcd or API server latency

Fix the storage or API server bottleneck first. Reduce etcd disk I/O contention, defragment etcd during a maintenance window if needed, and ensure the API server is not saturated by admission webhooks or APF throttling. Do not restart the controller-manager until underlying latency drops. Restarting into the same slow environment will produce a CrashLoopBackOff.

If the cause is lease timing misconfiguration

Adjust the flags with safe margins. The renew deadline must be strictly less than the lease duration, with enough room for the retry period to fire multiple times. Avoid aggressive custom values. If you must change them, keep the renew deadline well under the lease duration and test under production load before rolling out.

If the cause is RBAC drift

Restore get, create, and update permissions on leases in the kube-system namespace for the controller-manager identity. Restart the controller-manager after fixing the Role or ClusterRoleBinding.

If the cause is a retrieving-lock timeout cascade

If the controller-manager is in CrashLoopBackOff after repeated acquisition timeouts, resolve the root latency first, then restart the controller-manager instance. Simply restarting without fixing API server or etcd latency will not help because the new process will hit the same timeout.

If the cause is a leaderless window during maintenance

There is no user-configurable fix for lease release on clean exit in current stable releases. The ControllerManagerReleaseLeaderElectionLockOnExit feature gate exists but remains alpha and defaults to off. Run multiple controller-manager instances so failover is possible, and schedule control plane restarts during periods of low cluster mutation. The gap equals the full lease TTL, typically ~15 seconds by default.

Prevention

  • Do not tune --leader-elect-lease-duration and --leader-elect-renew-deadline to values that leave no headroom. The defaults exist for a reason.
  • Treat etcd disk latency and API server mutating latency as leading indicators for control plane health. Alert on them before controller-manager restarts begin.
  • Monitor controller-manager pod restart count in kube-system as a binary signal of leader distress.
  • Protect the controller-manager RBAC bindings from accidental changes during cluster upgrades or policy audits.
  • Maintain HA with at least two controller-manager instances so a single leader exit does not leave the cluster entirely without reconciliation.

How Netdata helps

  • Correlate etcd WAL fsync latency with controller-manager pod restarts in kube-system to identify the cascade.
  • Track API server request latency and error rates to catch saturation before lease renewals fail.
  • Monitor control plane node disk I/O and CPU to surface resource pressure that slows etcd.
  • Alert on sustained increases in controller workqueue depth as an early sign that the leader is falling behind.
  • Surface audit log anomalies, including 403 responses on Lease operations, for RBAC-related leader election failures.