Back to all scenarios
Scenario #2
Cluster Management
K8s v1.24, GKE, heavy use of custom controllers
API Server Crash Due to Excessive CRD Writes
API server crashed due to flooding by a malfunctioning controller creating too many custom resources.
Find this helpful?
What Happened
A bug in a controller created thousands of Custom Resources (CRs) in a tight reconciliation loop. Etcd was flooded, leading to slow writes, and the API server eventually became non-responsive.
Diagnosis Steps
- 1API latency increased, leading to 504 Gateway Timeout errors in kubectl.
- 2Used kubectl get crds | wc -l to list all CRs.
- 3Analyzed controller logs – found infinite reconcile on a specific CR type.
- 4etcd disk I/O was maxed.
Root Cause
Bad logic in reconcile loop: create was always called regardless of the state, creating resource floods.
Fix/Workaround
• Scaled the controller to 0 replicas.
• Manually deleted thousands of stale CRs using batch deletion.
Lessons Learned
Always test reconcile logic in a sandboxed cluster.
How to Avoid
- 1Implement create/update guards in reconciliation.
- 2Add Prometheus alert for high CR count.