Back to all scenarios
Scenario #26
Cluster Management
K8s v1.22, managed AKS
Taints and Tolerations Mismatch Prevented Workload Scheduling
Workloads failed to schedule on new nodes that had a taint the workloads didn’t tolerate.
Find this helpful?
What Happened
Platform team added a new node pool with node-role.kubernetes.io/gpu:NoSchedule, but forgot to add tolerations to GPU workloads.
Diagnosis Steps
- 1kubectl describe pod – showed reason: “0/3 nodes are available: node(s) had taints”.
- 2Checked node taints via kubectl get nodes -o json.
Root Cause
Taints on new node pool weren’t matched by tolerations in pods.
Fix/Workaround
• Added proper tolerations to workloads:
yaml
CopyEdit
tolerations:
- key: "node-role.kubernetes.io/gpu"
operator: "Exists"
effect: "NoSchedule"Lessons Learned
Node taints should be coordinated with scheduling policies.
How to Avoid
- 1Use preset toleration templates in CI/CD pipelines.
- 2Test new node pools with dummy workloads.