Kubernetes Deployment rollout stuck: stalled rollouts and ready replicas
A Deployment rollout that stalls is a silent capacity leak. The old ReplicaSet scales down, the new ReplicaSet stops halfway, and kubectl rollout status blocks indefinitely. Kubernetes does not automatically recover. The controller sets ProgressDeadlineExceeded only after progressDeadlineSeconds elapses, and takes no corrective action. You need to distinguish between a Pod lifecycle blockage, a readiness probe or gate failure, and a rare controller bug that freezes the rollout entirely.
A Pod can be running and passing liveness probes but remain unready because of a missing readiness gate or a failing readiness probe. The controller tracks this, but will not fix it. This guide shows how to isolate the cause.
What this means
A Deployment rollout creates a new ReplicaSet and shifts replicas from the old one. A Pod counts toward readyReplicas only when its Ready condition is True, which requires every container to pass its readiness probe and every readiness gate condition in .status.conditions to report True. Pods in the Terminating phase are no longer counted in availableReplicas, but they continue to consume node resources until fully removed.
progressDeadlineSeconds (default 600s) is the controller’s patience timer. If the rollout does not make progress within this window, the Progressing condition becomes False with reason ProgressDeadlineExceeded. Kubernetes does not roll back or restart Pods automatically. If the Deployment is paused, the deadline is not evaluated. After a rollout completes, the condition stays True with reason NewReplicaSetAvailable indefinitely, even if ready replicas later crash or become unschedulable. The controller will not fire ProgressDeadlineExceeded for post-rollout replica shortfall.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Readiness probe failure or misconfiguration | New Pods are Running but Ready=False; events show probe failures | kubectl describe pod for probe events and lastState |
| Missing readiness gate condition | ContainersReady=True, Ready=False; no matching custom condition in Pod status | Pod spec for readinessGates and status for corresponding condition types |
| Resource exhaustion or scheduling block | New Pods stuck in Pending; no nodes pass predicates | kubectl describe pod events and node allocatable resources |
| Stale ReplicaSet annotation during scale (maxSurge=0) | Deployment frozen mid-rollout; no error events | Active ReplicaSet annotation deployment.kubernetes.io/desired-replicas vs Deployment spec.replicas |
| Restrictive maxUnavailable / maxSurge | Rollout advances one Pod at a time or halts entirely | kubectl get deployment -o jsonpath='{.spec.strategy.rollingUpdate}' |
| Post-rollout replica loss | Rollout previously succeeded; availableReplicas dropped below spec.replicas | kubectl get deployment for availableReplicas and Pod status |
Quick checks
# Check Deployment replica counts and conditions
kubectl get deployment <name> -o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'
# Check ready, updated, and unavailable replica counts
kubectl get deployment <name>
# Check Pod readiness and phase for the new ReplicaSet
kubectl get pods -l pod-template-hash=<new-hash> -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,READY:.status.conditions[?(@.type=="Ready")].status
# Check for readiness gate conditions
kubectl get pod <pod> -o json | jq '.status.conditions[] | {type: .type, status: .status}'
# Check ReplicaSet desired-replicas annotation against Deployment spec
kubectl get rs -l app=<label> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.deployment\.kubernetes\.io/desired-replicas}{"\t"}{.spec.replicas}{"\n"}{end}'
# A mismatch indicates the stale-annotation bug.
# Check events (most rollout events are on Pods and ReplicaSets, not the Deployment)
kubectl get events --field-selector involvedObject.name=<pod-or-rs-name> --sort-by='.lastTimestamp'
# Verify rollout strategy limits
kubectl get deployment <name> -o jsonpath='{"maxUnavailable: "}{.spec.strategy.rollingUpdate.maxUnavailable}{" maxSurge: "}{.spec.strategy.rollingUpdate.maxSurge}{"\n"}'
How to diagnose it
Confirm the stall pattern. Run
kubectl get deployment <name>. IfreadyReplicasis belowspec.replicasandupdatedReplicasis not increasing, the rollout is stalled. IfunavailableReplicasis non-zero, Pods are failing to become ready.Check the
Progressingcondition. Runkubectl get deployment <name> -o jsonpath='{.status.conditions[?(@.type=="Progressing")]}'. IfstatusisFalseandreasonisProgressDeadlineExceeded, the controller has marked the rollout as stuck. If the condition is stillTrue, the deadline has not yet elapsed.Inspect the new ReplicaSet’s Pods. Identify the new
pod-template-hashand list those Pods. If they arePending, the issue is scheduling or image pulling. If they areRunningbut notReady, the issue is probes, gates, or container startup.Differentiate container readiness from Pod readiness. Run
kubectl get pod <pod> -o jsonpath='{.status.conditions[*].type}'. IfContainersReadyisTruebutReadyisFalse, examine.spec.readinessGates. Then check.status.conditionsfor the matching gate type. If the gate condition is absent, the Pod will never become ready.Evaluate readiness probes. In
kubectl describe pod, look forUnhealthyevents withReadiness probe failed. Verify that the probeport,path, andschemematch the listening interface inside the container. If the application has a slow startup,initialDelaySecondsor astartupProbemay be needed.Check for the stale-annotation bug. If the Deployment uses
maxSurge=0and was scaled during the rollout, compare the active ReplicaSet’sdeployment.kubernetes.io/desired-replicasannotation withdeployment.spec.replicas. If they differ, the controller is in an infinite loop and the rollout is frozen.Validate strategy arithmetic.
maxUnavailablerounds down when computed as a percentage, andmaxSurgerounds up. IfmaxUnavailablerounds down to zero andmaxSurgeis also zero, the rollout cannot make progress because it is forbidden to remove old Pods or add new ones beyond the limit. Ensure the strategy allows movement.Distinguish rollout stalls from post-rollout decay. If
updatedReplicasequalsspec.replicasand theProgressingcondition showsNewReplicaSetAvailable, the rollout is complete. IfavailableReplicasthen drops, this is a workload health or node problem, not a rollout stall. The controller will not surfaceProgressDeadlineExceededfor this.
flowchart TD
A[Deployment readyReplicas < spec.replicas] --> B{Pod phase?}
B -->|Pending| C[Check scheduling, resources, PVC]
B -->|Running, not Ready| D{ContainersReady?}
D -->|False| E[Check readiness probes, crashes]
D -->|True| F[Check readiness gates]
B -->|Terminating| G[Wait or check terminationGracePeriodSeconds]
C --> H[Fix capacity, taints, or image pull]
E --> I[Fix probe config or app health]
F --> J[Restore external controller or remove gate]
G --> K[Check maxSurge=0 stale annotation bug]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
status.readyReplicas vs spec.replicas | Direct measure of rollout completion | Gap persists longer than normal startup time |
status.availableReplicas vs spec.replicas | Tracks Pods that are ready for minReadySeconds | Drop after rollout completes indicates silent degradation |
Progressing condition reason and status | Only automatic controller signal for stalled rollouts | ProgressDeadlineExceeded or condition stuck at ReplicaSetUpdated |
Pod Ready false with ContainersReady true | Indicates readiness gate blockage | Any Pod in this state for more than 2 minutes |
| Container restart count | Repeated restarts prevent readiness | Restart count increasing across new ReplicaSet Pods |
Pod phase Pending | Scheduling, image, or volume stall | Pending duration exceeding 5 minutes |
kube-controller-manager workqueue_depth | Reconciliation backlog in the controller | Depth increasing while rollout is active |
Node MemoryPressure or DiskPressure | Pressure evictions kill new Pods before they become ready | Pressure condition true on nodes hosting new Pods |
Fixes
If the cause is readiness probe misconfiguration
Edit the container spec to correct the probe endpoint, increase timeoutSeconds, or add a startupProbe to cover slow initialization. If the application startup is legitimately longer than progressDeadlineSeconds, increase progressDeadlineSeconds in the Deployment spec.
If the cause is a missing readiness gate condition
Identify the external controller that writes the condition (for example, an ingress controller or service mesh). If that controller is down, restore it. If the gate is not essential, remove it from the Pod template as a workaround.
If the cause is resource or scheduling pressure
Add node capacity, reduce resource requests, or resolve taints that block the new Pods. If the stall is due to an unschedulable Pod, kubectl describe pod will show the specific predicate failure.
If the cause is the stale-annotation bug (maxSurge=0)
On affected Kubernetes versions, if scaling during a rolling update with maxSurge=0 triggers the infinite loop, work around it by forcing the annotation to match: scale the Deployment to a different value and back, or patch the ReplicaSet annotation directly.
If the cause is a too-restrictive strategy
Adjust maxUnavailable or maxSurge so that at least one of them is non-zero. For a single-replica Deployment, maxUnavailable: 0 and maxSurge: 1 is a common safe pattern. For larger Deployments, ensure maxUnavailable does not round down to zero unless maxSurge compensates.
If the cause is post-rollout replica loss
This is not fixed by rollout parameters. Investigate Pod crashes, node evictions, or CSI volume failures. Cordon failing nodes and trigger a new rollout only after the underlying issue is resolved.
Prevention
Set progressDeadlineSeconds slightly above your application’s known cold-start time, rather than accepting the default 600s if it is too short or too long. Validate readiness probes in a staging environment before promotion. Monitor the kube-controller-manager workqueue_depth to detect reconciliation lag before it becomes visible as stalled replicas. Avoid scaling Deployments during active rollouts if you must use maxSurge=0 on Kubernetes versions prior to the fix for the stale-annotation issue. Set explicit resource requests and limits to prevent scheduling stalls and evictions. If you use readiness gates, monitor the health of the external controllers that maintain those conditions separately. Alert on availableReplicas dropping below spec.replicas independently of rollout status, because Kubernetes does not re-fire ProgressDeadlineExceeded after a rollout finishes.
How Netdata helps
- Correlates node-level CPU, memory, and disk pressure with Pod scheduling failures and evictions that stall rollouts.
- Surfaces container restart loops and OOM kills that prevent new replicas from reaching the ready state.
- Tracks network latency and conntrack utilization to identify infrastructure-level causes of readiness probe timeouts.
- Brings together control-plane signals like API server latency and controller workqueue depth so you can distinguish a controller backlog from an application-level stall.
- Provides per-Pod resource usage to validate whether resource limits are causing startup delays.
Related guides
- If API server latency is causing controllers to lag, see Kubernetes API server slow or unresponsive: causes and fixes.
- For dropped connections that affect readiness probes, see Kubernetes conntrack exhaustion: dropped connections under load.
- If node pressure is evicting new Pods, see Kubernetes eviction cascade: when one node failure takes down the cluster.
- For DNS-related readiness failures, see Kubernetes DNS resolution failures inside pods.
- If scheduling is the blocker, see Kubernetes DaemonSet pods Pending: scheduling and tolerations.






