Back to all scenarios
Scenario #53
Cluster Management
K8s v1.19, Azure AKS
Unresponsive Cluster After Large-Scale Deployment
Cluster became unresponsive after deploying a large number of pods in a single batch.
Find this helpful?
What Happened
The cluster became unresponsive after deploying a batch of 500 pods in a single operation, causing resource exhaustion.
Diagnosis Steps
- 1Checked cluster logs and found that the control plane was overwhelmed with API requests.
- 2Observed resource limits on the nodes, which were maxed out.
Root Cause
The large-scale deployment exhausted the cluster’s available resources, causing a spike in API server load.
Fix/Workaround
• Implemented gradual pod deployment using rolling updates instead of a batch deployment.
• Increased the node resource capacity to handle larger loads.
Lessons Learned
Gradual deployments and resource planning are necessary when deploying large numbers of pods.
How to Avoid
- 1Use rolling updates or deploy in smaller batches.
- 2Monitor cluster resources and scale nodes accordingly.