Back to all scenarios
Scenario #53
Cluster Management
K8s v1.19, Azure AKS

Unresponsive Cluster After Large-Scale Deployment

Cluster became unresponsive after deploying a large number of pods in a single batch.

Find this helpful?
What Happened

The cluster became unresponsive after deploying a batch of 500 pods in a single operation, causing resource exhaustion.

Diagnosis Steps
  • 1Checked cluster logs and found that the control plane was overwhelmed with API requests.
  • 2Observed resource limits on the nodes, which were maxed out.
Root Cause

The large-scale deployment exhausted the cluster’s available resources, causing a spike in API server load.

Fix/Workaround
• Implemented gradual pod deployment using rolling updates instead of a batch deployment.
• Increased the node resource capacity to handle larger loads.
Lessons Learned

Gradual deployments and resource planning are necessary when deploying large numbers of pods.

How to Avoid
  • 1Use rolling updates or deploy in smaller batches.
  • 2Monitor cluster resources and scale nodes accordingly.