Back to all scenarios
Scenario #128
Cluster Management
K8s v1.19, GKE

Node Pool Draining Timeout Due to Slow Pod Termination

The node pool draining process timed out during upgrades due to pods taking longer than expected to terminate.

Find this helpful?
What Happened

During a node pool upgrade, the nodes took longer to drain due to some pods having long graceful termination periods. This caused the upgrade process to time out.

Diagnosis Steps
  • 1Observed that kubectl get pods showed several pods in the terminating state for extended periods.
  • 2Checked pod logs and noted that they were waiting for a cleanup process to complete during termination.
Root Cause

Slow pod termination due to resource cleanup tasks caused delays in the node draining process.

Fix/Workaround
• Reduced the grace period for pod termination.
• Optimized resource cleanup tasks in the pods to reduce termination times.
Lessons Learned

Pod termination times should be minimized to avoid delays during node drains or upgrades.

How to Avoid
  • 1Optimize pod termination logic and cleanup tasks to ensure quicker pod termination.
  • 2Regularly test node draining during cluster maintenance to identify potential issues.