Node Drain Race Condition During Scale Down

Node drain raced with pod termination, causing pod loss.

Find this helpful?

What Happened

Pods were terminated while the node was still draining, leading to data loss.

Diagnosis Steps

Root Cause

Scale-down process didn’t wait for node draining to complete fully.

Fix/Workaround

• Adjusted terminationGracePeriodSeconds for pods.
• Introduced node draining delay in scaling policy.

Lessons Learned

Node draining should be synchronized with pod termination.

How to Avoid