Back to all scenarios
Scenario #461
Scaling & Load
Kubernetes v1.25, AWS EKS

Node Failure During Pod Scaling Up

Scaling up pods failed when a node was unexpectedly terminated, preventing proper pod scheduling.

Find this helpful?
What Happened

During an autoscaling event, a node was unexpectedly terminated due to cloud infrastructure issues. This caused new pods to fail scheduling as no available node had sufficient resources.

Diagnosis Steps
  • 1Checked the node status and found that the node had been terminated by AWS.
  • 2Observed that there were no available nodes with the required resources for new pods.
Root Cause

Unexpected node failure during the scaling process.

Fix/Workaround
• Configured the Cluster Autoscaler to provision more nodes and preemptively account for potential node failures.
• Ensured the cloud provider's infrastructure health was regularly monitored.
Lessons Learned

Autoscaling should anticipate infrastructure issues such as node failure to avoid disruptions.

How to Avoid
  • 1Set up proactive monitoring for cloud infrastructure and integrate with Kubernetes scaling mechanisms.
  • 2Ensure Cluster Autoscaler is tuned to handle unexpected node failures quickly.