Kubernetes API server rate limiting: APF priority levels and starvation

Your API server is running. /healthz returns 200. /readyz passes. Yet nodes drop to NotReady, the scheduler stops placing pods, and controller logs fill with context deadline exceeded. The cluster is not down, but it is frozen. This pattern often points to API Priority and Fairness (APF) starvation: low-priority traffic consumes the API server’s concurrency budget, and critical control plane requests queue or get rejected.

APF is enabled by default in Kubernetes 1.20+. It classifies every API request into a priority level via FlowSchema rules, then schedules requests against a per-level concurrency limit. When a priority level exhausts its seats, requests queue. If the queue fills, the server returns HTTP 429. When the queue grows in system or leader-election, kubelets cannot renew leases, controllers cannot write status, and the cluster degrades from the inside out. This guide shows how to confirm APF starvation, identify the culprit, and fix the allocation without turning the API server into a free-for-all.

What this means

APF replaces the older global --max-requests-inflight limits with a fair-queuing system. Two CRDs control APF:

  • FlowSchema assigns incoming requests to a priority level based on user, verb, resource, namespace, or source.
  • PriorityLevelConfiguration defines the concurrency share and queue size for each level.

The default levels include exempt (no limits), system (for system:masters), leader-election, workload-high (built-in controllers), workload-low (general authenticated traffic), and catch-all (unauthenticated or unmatched traffic).

Each non-exempt level receives an effective concurrency limit proportional to its nominalConcurrencyShares relative to the total shares across all levels. For example, in a cluster where the server concurrency limit is 600 and total shares are 100, a level with 10 shares gets an effective limit of 60 concurrent requests.

Starvation happens when lower-priority traffic, such as a runaway operator or CI pipeline, consumes its own seats plus any available headroom, leaving system or leader-election without capacity. Those critical requests then sit in queue until they time out or are rejected. The symptoms look like a slow control plane, but the root cause is distribution, not total volume.

Common causes

CauseWhat it looks likeFirst thing to check
Runaway controller or operatorworkload-low executing at 100% of its limit; system queue depth risingapiserver_flowcontrol_current_executing_requests by priority_level
Insufficient concurrency for critical levelsleader-election or system queues grow during normal loadprioritylevelconfigurations shares and total cluster shares
Misconfigured FlowSchemaKubelet or controller traffic classified into catch-allflowschemas matching rules for critical users
Thundering herd after recoveryAll priority levels show queue spikes simultaneouslyRequest rate by flow schema
Global API server saturationapiserver_current_inflight_requests near the hard limit; 429s across all levelsGlobal inflight vs --max-requests-inflight

Quick checks

# Check APF queue depth by priority level
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_inqueue_requests

# Check concurrency utilization per priority level
kubectl get --raw /metrics | grep apiserver_flowcontrol_current_executing_requests

# Check APF rejected requests by priority level and flow schema
kubectl get --raw /metrics | grep apiserver_flowcontrol_rejected_requests_total

# Check 429 rate from the API server
kubectl get --raw /metrics | grep 'apiserver_request_total.*code="429"'

# Check global inflight requests
kubectl get --raw /metrics | grep apiserver_current_inflight_requests

# View current APF configuration
kubectl get prioritylevelconfigurations -o custom-columns=NAME:.metadata.name,CONCURRENCY:.spec.limited.nominalConcurrencyShares
kubectl get flowschemas

A healthy cluster shows zero sustained queue depth in system and leader-election, a 429 rate near zero, and inflight requests well below the hard limit.

How to diagnose it

  1. Confirm APF is actively throttling. Check apiserver_flowcontrol_rejected_requests_total and apiserver_request_total{code="429"}. If 429s are present, APF is the bottleneck. If absent, look at etcd latency or admission webhooks instead.

  2. Identify which priority levels are queuing. Check apiserver_flowcontrol_current_inqueue_requests by priority_level. Any sustained queue depth in system or leader-election is critical. Queueing in workload-low or catch-all is expected under load and is APF working as designed.

  3. Find the level consuming all concurrency. Compare apiserver_flowcontrol_current_executing_requests against apiserver_flowcontrol_request_concurrency_limit for each priority level. If workload-low is at 100% while system is queuing, a noisy neighbor is starving critical traffic.

  4. Pinpoint the specific client or flow. Use apiserver_flowcontrol_rejected_requests_total broken down by flow_schema, or inspect audit logs for the user-agent and username generating the flood. A single flow schema dominating the request count indicates a runaway controller, aggressive CI job, or misconfigured operator.

  5. Distinguish local saturation from global overload. Check apiserver_current_inflight_requests. If global inflight is well below --max-requests-inflight but APF is rejecting traffic, the issue is share misallocation. If inflight is at the global limit, the server is universally overloaded.

  6. Correlate with downstream impact. Check node Ready conditions and controller logs. If kubelets miss heartbeats or the scheduler times out on leader election, APF starvation is already causing cluster-wide degradation. This confirms urgency.

flowchart TD
    A[Runaway controller floods workload-low] --> B[APF concurrency exhausted]
    B --> C[System and leader-election requests queue]
    C --> D[Kubelet heartbeats delayed]
    D --> E[Nodes marked NotReady]
    C --> F[Controller updates timeout]
    F --> G[Scheduling and reconciliation lag]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
APF queue depth (system / leader-election)Critical control plane traffic is waiting instead of executingQueue depth > 0 sustained for more than 30 seconds
APF rejected requests (system / leader-election)Critical traffic is being dropped with 429Any non-zero rate in these levels
APF concurrency utilization per levelHow close each level is to its effective limit> 80% of limit sustained
429 response rateActive throttling by APF> 5% of total API requests
Inflight requests (mutating / read-only)Global API server saturation> 80% of --max-requests-inflight
Controller timeout errorsDownstream impact of queue delayscontext deadline exceeded in controller or kubelet logs

Fixes

If the cause is a runaway controller or operator

Identify the offending client from audit logs or the flow_schema label on rejected requests. Throttle the client at the source: add client-side rate limits, reduce polling frequency, or fix the reconciliation loop. Do not raise APF limits to absorb bad behavior; the client will keep growing until it hits the next ceiling.

If the cause is insufficient concurrency shares

Edit the PriorityLevelConfiguration for system and leader-election to increase nominalConcurrencyShares. Remember that shares are relative to the total across all levels; increasing shares for one level reduces the effective limit of others unless you also raise the server concurrency limit via --max-requests-inflight and --max-mutating-requests-inflight. Ensure the API server’s CPU, memory, and etcd backing can handle the additional load before raising global limits.

If the cause is misconfigured flow schemas

Ensure that kubelet, controller-manager, and scheduler traffic match dedicated high-priority flow schemas. The default schemas cover built-in components, but custom controllers or infrastructure agents often fall into catch-all. Create specific FlowSchema resources for these components, matching on their service account or user group, and assign them to workload-high or a custom high-priority level.

If the cause is a thundering herd

If the traffic is legitimate but bursty, add jitter to client retry logic and ensure exponential backoff respects 429 responses. Temporarily increasing concurrency shares can provide relief, but the permanent fix is client behavior.

If cluster stability is at risk

As a last resort, you can temporarily move a critical service account to the exempt priority level. This bypasses all queuing and can destabilize the API server if the client floods requests. Revert immediately after recovery. Long-term exemptions defeat the purpose of APF.

Prevention

  • Review APF configuration quarterly and after adding major operators. New controllers change the request mix.
  • Monitor system and leader-election queue depth as a leading indicator, not a lagging one.
  • Ensure every critical controller has a dedicated FlowSchema resource. Do not let important traffic fall into catch-all.
  • Document which service accounts and user groups each FlowSchema matches. Stale selectors silently reclassify traffic after deployments change.
  • Set client-side rate limits and backoff on all custom controllers and automation.
  • Test APF behavior under load. A deployment of 1,000 replicas should not push workload-low into a state that starves leader-election.

How Netdata helps

  • Correlate APF queue depth with API server request latency to distinguish queuing delays from etcd latency.
  • Alert on sustained queue depth in system or leader-election before nodes transition to NotReady.
  • Track 429 spikes alongside etcd disk latency and webhook latency to isolate the true bottleneck.
  • Visualize per-priority-level concurrency utilization to spot noisy neighbors before they cause cluster-wide impact.
  • Monitor controller workqueue depth as a downstream signal that APF throttling delays reconciliation.