Back to all scenarios
Scenario #21
Cluster Management
K8s v1.24, AWS EKS with Cluster Autoscaler

Cluster Autoscaler Continuously Spawning and Deleting Nodes

The cluster was rapidly scaling up and down, creating instability in workloads.

Find this helpful?
What Happened

A misconfigured deployment had a readiness probe that failed intermittently, making pods seem unready. Cluster Autoscaler detected these as unschedulable, triggering new node provisioning. Once the pod appeared healthy again, Autoscaler would scale down.

Diagnosis Steps
  • 1Monitored Cluster Autoscaler logs (kubectl -n kube-system logs -l app=cluster-autoscaler).
  • 2Identified repeated scale-up and scale-down messages.
  • 3Traced back to a specific deployment’s readiness probe.
Root Cause

Flaky readiness probe created false unschedulable pods.

Fix/Workaround
• Fixed the readiness probe to accurately reflect pod health.
• Tuned scale-down-delay-after-add and scale-down-unneeded-time settings.
Lessons Learned

Readiness probes directly impact Autoscaler decisions.

How to Avoid
  • 1Validate all probes before production deployments.
  • 2Use Autoscaler logging to audit scaling activity.