Kubernetes priority class evictions: how preemption actually works

Scheduler preemption and kubelet node-pressure eviction both use Pod Priority, but they follow different rules, interact differently with PodDisruptionBudgets, and produce different failure signatures. Misunderstanding the two mechanisms leads to critical workloads being terminated unexpectedly, high-priority pods staying pending while lower-priority pods run, or system pods being preempted after a PriorityClass change.

The scheduler preempts lower-priority pods to make room for a higher-priority pending pod. The kubelet evicts pods when a node is under resource pressure. A PodDisruptionBudget protects against voluntary disruptions, but the scheduler treats PDB violations during preemption as best-effort, not guaranteed.

What this means

  • The scheduler evaluates preemption candidates per node. It does not perform cross-node preemption. If a pending pod would need to evict a pod on a different node to fit, that node is excluded.
  • Victim selection order: the scheduler sorts potential victims by decreasing priority, then attempts to reprieve pods whose removal would violate a PDB. If every candidate on a node would violate a PDB, preemption proceeds anyway and removes lower-priority pods.
  • Kubelet eviction ranks victims by QoS class first (BestEffort, then Burstable, then Guaranteed), then by priority, then by resource usage relative to requests. A Guaranteed pod is evicted only after all BestEffort and Burstable pods.
  • preemptionPolicy: Never places a pod ahead in the scheduling queue but forbids it from evicting other pods. It can still be preempted by a higher-priority pod.
  • Only one PriorityClass can have globalDefault: true. Adding one does not retroactively change existing pod priorities.
flowchart TD
    A[High-priority pod Pending] --> B{Scheduler evaluates nodes}
    B --> C[Select victims by decreasing priority]
    C --> D{Would removal violate PDB?}
    D -->|Prefer non-violating victim| E[Reprieve violating pods]
    D -->|No clean victim exists| F[Preempt anyway]
    E --> G[Set nominatedNodeName]
    F --> G
    G --> H[Pod binds to node]

    I[Node under memory/disk pressure] --> J{Kubelet eviction manager}
    J --> K{Pod QoS and usage vs requests}
    K -->|BestEffort or over-request Burstable| L[Rank by priority, then relative usage]
    K -->|Within-request or Guaranteed| M[Lower eviction risk]
    L --> N[Evict victim pod]

Common causes

CauseWhat it looks likeFirst thing to check
Scheduler preemptionPod terminated with reason=Preempted; higher-priority pod appears on the same nodekubectl get events --field-selector reason=Preempted
Kubelet node-pressure evictionPod evicted with reason=Evicted and node condition MemoryPressure or DiskPressurekubectl describe node <node> and kubelet eviction signals
PDB violation during preemptionPods preempted even though a PDB exists; disruptionsAllowed is zerokubectl get pdb <name> and check status fields
globalDefault misconfigurationNew low-priority class preempts existing critical workloads that had no explicit prioritykubectl get priorityclasses for globalDefault: true
Inter-pod affinity blocking candidatesHigh-priority pod stays Pending with no preemption events even though nodes seem to have capacityPod spec podAffinity rules and scheduler logs

Quick checks

Run these commands to distinguish preemption, eviction, and configuration traps.

# Check for preemption events in the cluster
kubectl get events --all-namespaces --field-selector reason=Preempted \
  --sort-by='.lastTimestamp'

# Check a pod's priority class and preemption policy
kubectl get pod <pod-name> -o jsonpath='{.spec.priorityClassName}{"\n"}{.spec.preemptionPolicy}'

# Check if a pending pod has been nominated to a specific node
kubectl get pod <pod-name> -o jsonpath='{.status.nominatedNodeName}'

# Check node pressure conditions and kubelet eviction thresholds
kubectl describe node <node-name> | grep -A 5 "Conditions:"

# Check PDB status and recent disruptions
kubectl get pdb <pdb-name> -o json | jq '.status'

# List PriorityClasses and identify any global default
kubectl get priorityclasses -o custom-columns=NAME:.metadata.name,GLOBAL:.globalDefault,VALUE:.value

# Check scheduler logs for preemption decisions (requires control plane access)
kubectl logs -n kube-system kube-scheduler-<node> | grep -i preempt

# Check kubelet logs for eviction ranking (on the node)
journalctl -u kubelet --since "10 minutes ago" | grep -i "eviction\|rank"

How to diagnose it

  1. Determine whether the termination was preemption or eviction.
    Check the terminated pod status. Preemption sets reason: Preempted and emits a Kubernetes event. Eviction sets reason: Evicted with a message naming the starved resource. The fixes differ completely.

  2. If preemption, identify the preempting pod and its priority.
    The preemption event names the preempting pod. Check its priorityClassName, priority value, and preemptionPolicy. A pod with preemptionPolicy: Never cannot preempt others; look for a different higher-priority pod.

  3. Check the victim pod’s priority relative to the cluster.
    Confirm the victim was actually lower priority. A common mistake is creating a globalDefault: true class with a low value after critical workloads are already running. Those workloads default to priority 0 and can be preempted by any pod with a positive priority.

  4. Inspect the PodDisruptionBudget state.
    Check kubectl get pdb <name> -o json. Inspect status.currentHealthy, status.desiredHealthy, and status.disruptionsAllowed. The scheduler attempts to avoid PDB violations, but if every candidate on the node would violate the PDB, it preempts anyway.

  5. If eviction, verify node pressure and resource accounting.
    Check node conditions for MemoryPressure, DiskPressure, or PIDPressure. Then verify the evicted pod’s QoS class and resource usage. BestEffort pods are evicted first. Burstable pods with usage above requests are next. Burstable pods within requests and Guaranteed pods are evicted only after all higher-risk pods. Priority breaks ties within each QoS group.

  6. Look for nominatedNodeName on pending pods.
    A preempting pod gets status.nominatedNodeName set to reserve the node. If the pod remains pending despite a nominated node, a higher-priority pod may have claimed the node afterward and cleared the nomination.

  7. Check for inter-pod affinity and anti-affinity constraints.
    Inter-pod affinity toward a lower-priority pod on a node excludes that node from preemption candidates. The scheduler cannot evict the pod the pending workload requires. Review podAffinity rules if preemption seems impossible but nodes appear to have capacity.

  8. Correlate with cluster-level resource pressure.
    If pending pods and node pressure coincide, compare node allocatable against scheduled requests (kubectl get nodes -o json). Requests near allocatable mean preemption and eviction are symptoms of capacity exhaustion, not misconfiguration.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
scheduler_pending_pods gaugeScheduling backlog and preemption pressureUnschedulable queue growing
Node MemoryPressure / DiskPressureTriggers kubelet eviction; priority only affects ranking among over-limit podsAny condition True
Pod eviction eventsDirect evidence of node-pressure evictionreason=Evicted events increasing
Pod restart count / CrashLoopBackOffPreempted or evicted pods restart; high restart counts indicate instabilityRestart count > 0 for critical pods
Container OOM killsPrecede or accompany memory-pressure evictionlastState.terminated.reason=OOMKilled
apiserver_request_total rateScheduler and disruption controller rely on the API server; latency or errors can stall PDB updatesElevated 5xx or 429 from control plane components
Controller workqueue depthA lagging disruption controller causes stale PDB state seen by the schedulerworkqueue_depth for endpoint or disruption controllers growing
Node allocatable vs requestedCapacity exhaustion drives both preemption and evictionRequests > 80% of allocatable for memory

Fixes

If the cause is scheduler preemption

  • Reduce priority spread. If the gap between your highest and lowest PriorityClass values is extreme, lower-priority pods become trivial victims. Narrow the range or consolidate classes.
  • Use preemptionPolicy: Never for batch or spot workloads. This prevents them from evicting other pods while still allowing them to be scheduled ahead of lower-priority work in the queue.
  • Set globalDefault carefully. If you must use a global default, set it on a neutral class with a middle value, and never apply it before explicit PriorityClasses are assigned to critical pods. Better yet, avoid globalDefault entirely and set priorityClassName explicitly in every production manifest.
  • Plan capacity for high-priority burstable workloads. If a high-priority pod repeatedly preempts lower-priority pods on the same node, the node is undersized for the workload mix.

If the cause is kubelet node-pressure eviction

  • Set Guaranteed QoS for critical pods. Guaranteed pods are evicted last. BestEffort pods are always the first victims.
  • Increase node capacity or reduce density. If eviction is frequent, the node is overcommitted on actual usage, not just requests.
  • Tune kubelet eviction thresholds. If the default thresholds are too aggressive for your workload, adjust --eviction-hard and --eviction-soft flags. Be aware that raising thresholds increases the risk of system OOM before kubelet acts.
  • Fix application memory leaks. Eviction often targets the same pod repeatedly because its usage grows until it exceeds requests. Address the leak rather than raising limits indefinitely.

If the cause is PDB violation during preemption

  • Accept that PDB enforcement during preemption is best-effort. Do not rely on PDBs to prevent scheduler preemption. Use PDBs for voluntary disruptions such as node drains and rolling updates, and overprovision capacity for critical pods.
  • Increase minAvailable or use maxUnavailable cautiously. A tight PDB on a low-priority workload is more likely to be violated during preemption. Ensure critical workloads have enough replicas spread across failure domains.
  • Monitor disruption controller lag. If PDB violations spike during periods of high API server latency, the disruption controller may be behind. Reduce control plane load or scale the controller manager resources.

If the cause is inter-pod affinity or node exclusion

  • Remove unnecessary affinity rules. If a pending pod’s affinity references a lower-priority pod on a node, consider whether that affinity is strictly required. If not, remove it to allow preemption on that node.
  • Use requiredDuringSchedulingIgnoredDuringExecution sparingly. Hard affinity constraints reduce the scheduler’s flexibility and amplify the impact of preemption.

Prevention

  • Assign PriorityClasses before you need them. Create your critical PriorityClasses and assign them to all system and production pods before introducing any low-priority or batch PriorityClass.
  • Require resource requests in all namespaces. Use a LimitRange to enforce default requests and limits. This protects pods from kubelet eviction and makes scheduling decisions predictable.
  • Monitor pending pods by priority. Track scheduler_pending_pods and break it down by PriorityClass if possible. A growing queue for high-priority pods signals capacity or preemption problems before they become critical.
  • Test PriorityClass changes in a staging cluster. Apply the class, create a higher-priority pod, and verify which pods are preempted. Do not test PriorityClass changes in production for the first time.
  • Maintain capacity headroom. Keep node requests below 80% of allocatable and maintain pod count headroom. Preemption and eviction are both symptoms of running too close to the edge.

How Netdata helps

Correlate these signals in Netdata to distinguish the root cause:

  • Node memory and disk pressure charts with kubelet eviction events to confirm resource-driven eviction versus scheduler preemption.
  • Container restart and OOM kill counts that spike after preemption or eviction.
  • Pending pod counts and API server latency to catch scheduling backlogs.
  • Controller workqueue depth to detect disruption controller lag.