Back to all scenarios
Scenario #461
Scaling & Load
Kubernetes v1.25, AWS EKS
Node Failure During Pod Scaling Up
Scaling up pods failed when a node was unexpectedly terminated, preventing proper pod scheduling.
Find this helpful?
What Happened
During an autoscaling event, a node was unexpectedly terminated due to cloud infrastructure issues. This caused new pods to fail scheduling as no available node had sufficient resources.
Diagnosis Steps
- 1Checked the node status and found that the node had been terminated by AWS.
- 2Observed that there were no available nodes with the required resources for new pods.
Root Cause
Unexpected node failure during the scaling process.
Fix/Workaround
• Configured the Cluster Autoscaler to provision more nodes and preemptively account for potential node failures.
• Ensured the cloud provider's infrastructure health was regularly monitored.
Lessons Learned
Autoscaling should anticipate infrastructure issues such as node failure to avoid disruptions.
How to Avoid
- 1Set up proactive monitoring for cloud infrastructure and integrate with Kubernetes scaling mechanisms.
- 2Ensure Cluster Autoscaler is tuned to handle unexpected node failures quickly.