API Server Crash Due to Excessive CRD Writes

API server crashed due to flooding by a malfunctioning controller creating too many custom resources.

Find this helpful?

What Happened

A bug in a controller created thousands of Custom Resources (CRs) in a tight reconciliation loop. Etcd was flooded, leading to slow writes, and the API server eventually became non-responsive.

Diagnosis Steps

1API latency increased, leading to 504 Gateway Timeout errors in kubectl.
2Used kubectl get crds | wc -l to list all CRs.
3Analyzed controller logs – found infinite reconcile on a specific CR type.
4etcd disk I/O was maxed.

Root Cause

Bad logic in reconcile loop: create was always called regardless of the state, creating resource floods.

Fix/Workaround

• Scaled the controller to 0 replicas.
• Manually deleted thousands of stale CRs using batch deletion.

Lessons Learned

Always test reconcile logic in a sandboxed cluster.

How to Avoid

1Implement create/update guards in reconciliation.
2Add Prometheus alert for high CR count.

Previous Scenario Next Scenario