Kubernetes priority class evictions: how preemption actually works
Scheduler preemption and kubelet node-pressure eviction both use Pod Priority, but they follow different rules, interact differently with PodDisruptionBudgets, and produce different failure signatures. Misunderstanding the two mechanisms leads to critical workloads being terminated unexpectedly, high-priority pods staying pending while lower-priority pods run, or system pods being preempted after a PriorityClass change.
The scheduler preempts lower-priority pods to make room for a higher-priority pending pod. The kubelet evicts pods when a node is under resource pressure. A PodDisruptionBudget protects against voluntary disruptions, but the scheduler treats PDB violations during preemption as best-effort, not guaranteed.
What this means
- The scheduler evaluates preemption candidates per node. It does not perform cross-node preemption. If a pending pod would need to evict a pod on a different node to fit, that node is excluded.
- Victim selection order: the scheduler sorts potential victims by decreasing priority, then attempts to reprieve pods whose removal would violate a PDB. If every candidate on a node would violate a PDB, preemption proceeds anyway and removes lower-priority pods.
- Kubelet eviction ranks victims by QoS class first (BestEffort, then Burstable, then Guaranteed), then by priority, then by resource usage relative to requests. A Guaranteed pod is evicted only after all BestEffort and Burstable pods.
preemptionPolicy: Neverplaces a pod ahead in the scheduling queue but forbids it from evicting other pods. It can still be preempted by a higher-priority pod.- Only one PriorityClass can have
globalDefault: true. Adding one does not retroactively change existing pod priorities.
flowchart TD
A[High-priority pod Pending] --> B{Scheduler evaluates nodes}
B --> C[Select victims by decreasing priority]
C --> D{Would removal violate PDB?}
D -->|Prefer non-violating victim| E[Reprieve violating pods]
D -->|No clean victim exists| F[Preempt anyway]
E --> G[Set nominatedNodeName]
F --> G
G --> H[Pod binds to node]
I[Node under memory/disk pressure] --> J{Kubelet eviction manager}
J --> K{Pod QoS and usage vs requests}
K -->|BestEffort or over-request Burstable| L[Rank by priority, then relative usage]
K -->|Within-request or Guaranteed| M[Lower eviction risk]
L --> N[Evict victim pod]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Scheduler preemption | Pod terminated with reason=Preempted; higher-priority pod appears on the same node | kubectl get events --field-selector reason=Preempted |
| Kubelet node-pressure eviction | Pod evicted with reason=Evicted and node condition MemoryPressure or DiskPressure | kubectl describe node <node> and kubelet eviction signals |
| PDB violation during preemption | Pods preempted even though a PDB exists; disruptionsAllowed is zero | kubectl get pdb <name> and check status fields |
globalDefault misconfiguration | New low-priority class preempts existing critical workloads that had no explicit priority | kubectl get priorityclasses for globalDefault: true |
| Inter-pod affinity blocking candidates | High-priority pod stays Pending with no preemption events even though nodes seem to have capacity | Pod spec podAffinity rules and scheduler logs |
Quick checks
Run these commands to distinguish preemption, eviction, and configuration traps.
# Check for preemption events in the cluster
kubectl get events --all-namespaces --field-selector reason=Preempted \
--sort-by='.lastTimestamp'
# Check a pod's priority class and preemption policy
kubectl get pod <pod-name> -o jsonpath='{.spec.priorityClassName}{"\n"}{.spec.preemptionPolicy}'
# Check if a pending pod has been nominated to a specific node
kubectl get pod <pod-name> -o jsonpath='{.status.nominatedNodeName}'
# Check node pressure conditions and kubelet eviction thresholds
kubectl describe node <node-name> | grep -A 5 "Conditions:"
# Check PDB status and recent disruptions
kubectl get pdb <pdb-name> -o json | jq '.status'
# List PriorityClasses and identify any global default
kubectl get priorityclasses -o custom-columns=NAME:.metadata.name,GLOBAL:.globalDefault,VALUE:.value
# Check scheduler logs for preemption decisions (requires control plane access)
kubectl logs -n kube-system kube-scheduler-<node> | grep -i preempt
# Check kubelet logs for eviction ranking (on the node)
journalctl -u kubelet --since "10 minutes ago" | grep -i "eviction\|rank"
How to diagnose it
Determine whether the termination was preemption or eviction.
Check the terminated pod status. Preemption setsreason: Preemptedand emits a Kubernetes event. Eviction setsreason: Evictedwith a message naming the starved resource. The fixes differ completely.If preemption, identify the preempting pod and its priority.
The preemption event names the preempting pod. Check itspriorityClassName,priorityvalue, andpreemptionPolicy. A pod withpreemptionPolicy: Nevercannot preempt others; look for a different higher-priority pod.Check the victim pod’s priority relative to the cluster.
Confirm the victim was actually lower priority. A common mistake is creating aglobalDefault: trueclass with a low value after critical workloads are already running. Those workloads default to priority0and can be preempted by any pod with a positive priority.Inspect the PodDisruptionBudget state.
Checkkubectl get pdb <name> -o json. Inspectstatus.currentHealthy,status.desiredHealthy, andstatus.disruptionsAllowed. The scheduler attempts to avoid PDB violations, but if every candidate on the node would violate the PDB, it preempts anyway.If eviction, verify node pressure and resource accounting.
Check node conditions forMemoryPressure,DiskPressure, orPIDPressure. Then verify the evicted pod’s QoS class and resource usage. BestEffort pods are evicted first. Burstable pods with usage above requests are next. Burstable pods within requests and Guaranteed pods are evicted only after all higher-risk pods. Priority breaks ties within each QoS group.Look for
nominatedNodeNameon pending pods.
A preempting pod getsstatus.nominatedNodeNameset to reserve the node. If the pod remains pending despite a nominated node, a higher-priority pod may have claimed the node afterward and cleared the nomination.Check for inter-pod affinity and anti-affinity constraints.
Inter-pod affinity toward a lower-priority pod on a node excludes that node from preemption candidates. The scheduler cannot evict the pod the pending workload requires. ReviewpodAffinityrules if preemption seems impossible but nodes appear to have capacity.Correlate with cluster-level resource pressure.
If pending pods and node pressure coincide, compare nodeallocatableagainst scheduled requests (kubectl get nodes -o json). Requests near allocatable mean preemption and eviction are symptoms of capacity exhaustion, not misconfiguration.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
scheduler_pending_pods gauge | Scheduling backlog and preemption pressure | Unschedulable queue growing |
Node MemoryPressure / DiskPressure | Triggers kubelet eviction; priority only affects ranking among over-limit pods | Any condition True |
| Pod eviction events | Direct evidence of node-pressure eviction | reason=Evicted events increasing |
| Pod restart count / CrashLoopBackOff | Preempted or evicted pods restart; high restart counts indicate instability | Restart count > 0 for critical pods |
| Container OOM kills | Precede or accompany memory-pressure eviction | lastState.terminated.reason=OOMKilled |
apiserver_request_total rate | Scheduler and disruption controller rely on the API server; latency or errors can stall PDB updates | Elevated 5xx or 429 from control plane components |
| Controller workqueue depth | A lagging disruption controller causes stale PDB state seen by the scheduler | workqueue_depth for endpoint or disruption controllers growing |
| Node allocatable vs requested | Capacity exhaustion drives both preemption and eviction | Requests > 80% of allocatable for memory |
Fixes
If the cause is scheduler preemption
- Reduce priority spread. If the gap between your highest and lowest PriorityClass values is extreme, lower-priority pods become trivial victims. Narrow the range or consolidate classes.
- Use
preemptionPolicy: Neverfor batch or spot workloads. This prevents them from evicting other pods while still allowing them to be scheduled ahead of lower-priority work in the queue. - Set
globalDefaultcarefully. If you must use a global default, set it on a neutral class with a middle value, and never apply it before explicit PriorityClasses are assigned to critical pods. Better yet, avoidglobalDefaultentirely and setpriorityClassNameexplicitly in every production manifest. - Plan capacity for high-priority burstable workloads. If a high-priority pod repeatedly preempts lower-priority pods on the same node, the node is undersized for the workload mix.
If the cause is kubelet node-pressure eviction
- Set Guaranteed QoS for critical pods. Guaranteed pods are evicted last. BestEffort pods are always the first victims.
- Increase node capacity or reduce density. If eviction is frequent, the node is overcommitted on actual usage, not just requests.
- Tune kubelet eviction thresholds. If the default thresholds are too aggressive for your workload, adjust
--eviction-hardand--eviction-softflags. Be aware that raising thresholds increases the risk of system OOM before kubelet acts. - Fix application memory leaks. Eviction often targets the same pod repeatedly because its usage grows until it exceeds requests. Address the leak rather than raising limits indefinitely.
If the cause is PDB violation during preemption
- Accept that PDB enforcement during preemption is best-effort. Do not rely on PDBs to prevent scheduler preemption. Use PDBs for voluntary disruptions such as node drains and rolling updates, and overprovision capacity for critical pods.
- Increase
minAvailableor usemaxUnavailablecautiously. A tight PDB on a low-priority workload is more likely to be violated during preemption. Ensure critical workloads have enough replicas spread across failure domains. - Monitor disruption controller lag. If PDB violations spike during periods of high API server latency, the disruption controller may be behind. Reduce control plane load or scale the controller manager resources.
If the cause is inter-pod affinity or node exclusion
- Remove unnecessary affinity rules. If a pending pod’s affinity references a lower-priority pod on a node, consider whether that affinity is strictly required. If not, remove it to allow preemption on that node.
- Use
requiredDuringSchedulingIgnoredDuringExecutionsparingly. Hard affinity constraints reduce the scheduler’s flexibility and amplify the impact of preemption.
Prevention
- Assign PriorityClasses before you need them. Create your critical PriorityClasses and assign them to all system and production pods before introducing any low-priority or batch PriorityClass.
- Require resource requests in all namespaces. Use a
LimitRangeto enforce default requests and limits. This protects pods from kubelet eviction and makes scheduling decisions predictable. - Monitor pending pods by priority. Track
scheduler_pending_podsand break it down by PriorityClass if possible. A growing queue for high-priority pods signals capacity or preemption problems before they become critical. - Test PriorityClass changes in a staging cluster. Apply the class, create a higher-priority pod, and verify which pods are preempted. Do not test PriorityClass changes in production for the first time.
- Maintain capacity headroom. Keep node requests below 80% of allocatable and maintain pod count headroom. Preemption and eviction are both symptoms of running too close to the edge.
How Netdata helps
Correlate these signals in Netdata to distinguish the root cause:
- Node memory and disk pressure charts with kubelet eviction events to confirm resource-driven eviction versus scheduler preemption.
- Container restart and OOM kill counts that spike after preemption or eviction.
- Pending pod counts and API server latency to catch scheduling backlogs.
- Controller workqueue depth to detect disruption controller lag.
Related guides
- Kubernetes API server rate limiting: APF priority levels and starvation
- Kubernetes API server memory pressure: OOM cycle and tuning
- Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas
- Kubernetes DaemonSet pods Pending: scheduling and tolerations
- Kubernetes DNS resolution failures inside pods
- Kubernetes conntrack exhaustion: dropped connections under load






