Kubernetes ResourceQuota exceeded: detection and remediation

A Deployment looks healthy in kubectl get deployment but the new ReplicaSet has zero pods. A Job is accepted but never creates a pod. A CI/CD pipeline fails with opaque 403 errors from the API server. These symptoms point to ResourceQuota exhaustion. Confirm the quota is the blocker, identify the exhausted resource, and fix it.

What this means

ResourceQuota is a namespace-scoped admission controller that enforces hard aggregate limits on resource consumption. When a tracked resource hits its hard limit, the API server rejects subsequent create requests with HTTP 403 Forbidden and a message containing exceeded quota: <quota-name>. Existing pods keep running; quota is enforced at admission time, not by terminating workloads.

Quota tracks requested resources, not actual usage. If a namespace quota covers requests.cpu or requests.memory, every new pod must specify a request for that resource. A pod with no request is rejected even if the quota has remaining capacity. Quota for pods counts only non-terminal pods (Pending and Running). Completed Job pods in Succeeded or Failed phase do not count against the pods quota, but Pending or Unknown pods do.

Common causes

CauseWhat it looks likeFirst thing to check
Rolling update with no surge headroomDeployment rollout stalls; new ReplicaSet creates zero podskubectl get rs -n <namespace> and compare maxSurge to quota slack
Jobs silently retrying without creating podsJob object exists but no pods appear; no visible errors in kubectl get jobskubectl describe job <name> -n <namespace> for FailedCreate events
Missing resource requests on new podsPod create rejected even though total usage seems lowPod template for resources.requests
Operator or CI/CD object leaksQuota consumed by Secrets, ConfigMaps, or PVCs rather than podsResource breakdown in kubectl describe resourcequota
Ephemeral storage or PVC exhaustionPVC Pending after schedulingkubectl get pvc -n <namespace> and requests.storage quota

Quick checks

# Check all quotas and their current usage across namespaces
kubectl get resourcequota -A

# Describe a specific quota to see which resource is exhausted
kubectl describe resourcequota -n <namespace> <quota-name>

# Get structured used vs hard values
kubectl get resourcequota -n <namespace> -o json | \
  jq '.items[] | {name: .metadata.name, hard: .status.hard, used: .status.used}'

# Find recent quota-related creation failures
kubectl get events -n <namespace> --field-selector reason=FailedCreate

# Check if a Deployment rollout is stuck
kubectl rollout status deployment/<name> -n <namespace>

# Check if a Job is silently retrying without creating pods
kubectl describe job <name> -n <namespace>

# Verify pods have requests set when cpu/memory quota exists
kubectl get pod <name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'

How to diagnose it

  1. Confirm the quota block. Look for FailedCreate events in the namespace. The event message contains exceeded quota: followed by the quota name and the exhausted resource. If events have expired, check kubectl describe resourcequota directly.

  2. Identify the exhausted resource. kubectl describe resourcequota shows a table with Resource, Used, and Hard. Any resource where Used equals Hard is the blocker. Common culprits: pods, requests.cpu, requests.memory, limits.cpu, limits.memory, requests.storage, persistentvolumeclaims, services, secrets, or configmaps.

  3. Check non-terminal pods. The pods quota counts Pending and Running pods. If pods quota is exhausted but few pods are Running, look for Pending pods stuck unschedulable or pods in Unknown phase due to node pressure.

  4. Inspect rolling-update behavior. If the quota is sized for steady-state capacity and a Deployment uses RollingUpdate, the new ReplicaSet cannot create pods until old pods terminate. The Deployment appears healthy but the rollout stalls. Check ReplicaSet pod counts and compare maxSurge against available quota headroom.

  5. Evaluate Job behavior. The Job object is accepted but its pods are rejected. The controller retries silently. Check the Job’s events for quota failures.

  6. Verify LimitRange interaction. If a pod omits requests and the LimitRange does not provide defaults for a resource tracked by the quota, admission rejects the pod even if the quota is not exhausted.

flowchart TD
    A[Pod create failed or rollout stalled] --> B{Events show exceeded quota?}
    B -->|Yes| C[Describe ResourceQuota]
    B -->|No| D[Check FailedCreate events]
    D --> C
    C --> E{Which resource is at hard limit?}
    E -->|pods| F[Check maxSurge and Pending/Unknown pods]
    E -->|cpu/memory| G[Check pod requests and LimitRange defaults]
    E -->|storage/PVC| H[Check PVC claims and storage quota]
    E -->|secrets/configmaps| I[Audit operator or CI/CD object creation]
    F --> J[Adjust quota, reduce surge, or delete stuck pods and unused objects]
    G --> J
    H --> J
    I --> J

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kube_resourcequota used vs hardTracks utilization of every quota resource typeAny resource > 80% of hard limit
FailedCreate event rateDirect evidence of admission blocking pod creationSustained nonzero rate in a namespace
Pending pods countQuota-blocked pods may be stuck in PendingPending pods increasing while scheduler is healthy
Deployment ready vs desired replicasDetects rollout stalls from lack of quota headroomreadyReplicas < spec.replicas for extended period
Job active pod count vs completionsJobs silently back off when quota blocks pod creationActive pods stuck at zero with completions desired
Namespace object count by typeIdentifies which object type is consuming count quotaRapid growth in ConfigMaps, Secrets, or PVCs

Fixes

If the namespace is genuinely overcommitted

Delete unnecessary pods, scale down Deployments, or remove unused PVCs, Secrets, and ConfigMaps. If the workload is legitimate and persistent, raise the ResourceQuota hard value or move the workload to a less constrained namespace.

If the cause is rolling-update surge

Size the pods quota to steady_state_pods + maxSurge, or reduce maxSurge so the total pod count during rollout stays under quota. If you cannot change quota, switch the Deployment strategy to Recreate, accepting downtime during updates.

If the cause is non-terminal pods holding quota

Set ttlSecondsAfterFinished on Jobs so completed pods are garbage collected promptly. Delete stuck Pending or Unknown pods manually. Note that Succeeded and Failed pods do not count against pods quota; Pending and Unknown pods do.

If the cause is LimitRange interaction

Ensure every pod template specifies resources.requests for every resource tracked by the quota, or configure a LimitRange to supply defaults. Without defaults, admission rejects pods that omit requests even if the quota is not full.

If the cause is a transient admission race

During rolling updates, the quota controller’s informer cache may lag, causing status.used to drift and sporadic rejections. If intermittent quota-exceeded errors resolve within seconds, retry the operation. Persistent errors are not transient and require the fixes above.

Prevention

  • Account for surge in quota sizing. Set pods quota to (max replicas) + maxSurge for the largest Deployment, plus headroom for DaemonSets and standalone pods.
  • Alert before the wall. Monitor kube_resourcequota and alert when any resource exceeds 80% of its hard limit. Do not wait for 100%.
  • Clean up completed Jobs. Use ttlSecondsAfterFinished on Jobs and failedJobsHistoryLimit on CronJobs to prevent completed pods from lingering.
  • Audit namespace object growth. Secrets, ConfigMaps, and PVCs created by operators or CI/CD pipelines can silently exhaust count quotas.
  • Use LimitRange defaults. Pair ResourceQuota with a LimitRange that sets default cpu/memory requests so pods are not rejected for omitting them.

How Netdata helps

Netdata surfaces kube_resourcequota metrics from kube-state-metrics. Use them to:

  • Chart used against hard per namespace and resource type to spot approaching limits before failures occur.
  • Correlate spikes in FailedCreate event rates with quota utilization to confirm the bottleneck is admission, not scheduling or image pulls.
  • Overlay Deployment replica counts and pending pod counts to distinguish rollout stalls caused by quota exhaustion from application errors.