Back to all scenarios
Scenario #1
Cluster Management
K8s v1.23, On-prem bare metal, Systemd cgroups
Zombie Pods Causing NodeDrain to Hang
Node drain stuck indefinitely due to unresponsive terminating pod.
Find this helpful?
What Happened
A pod with a custom finalizer never completed termination, blocking kubectl drain. Even after the pod was marked for deletion, the API server kept waiting because the finalizer wasn’t removed.
Diagnosis Steps
- 1Checked kubectl get pods --all-namespaces -o wide to find lingering pods.
- 2Found pod stuck in Terminating state for over 20 minutes.
- 3Used kubectl describe pod <pod> to identify the presence of a custom finalizer.
- 4Investigated controller logs managing the finalizer – the controller had crashed.
Root Cause
Finalizer logic was never executed because its controller was down, leaving the pod undeletable.
Fix/Workaround
kubectl patch pod <pod-name> -p '{"metadata":{"finalizers":[]}}' --type=merge
Lessons Learned
Finalizers should have timeout or fail-safe logic.
How to Avoid
- 1Avoid finalizers unless absolutely necessary.
- 2Add monitoring for stuck Terminating pods.
- 3Implement retry/timeout logic in finalizer controllers.