Back to all scenarios
Scenario #2
Cluster Management
K8s v1.24, GKE, heavy use of custom controllers

API Server Crash Due to Excessive CRD Writes

API server crashed due to flooding by a malfunctioning controller creating too many custom resources.

Find this helpful?
What Happened

A bug in a controller created thousands of Custom Resources (CRs) in a tight reconciliation loop. Etcd was flooded, leading to slow writes, and the API server eventually became non-responsive.

Diagnosis Steps
  • 1API latency increased, leading to 504 Gateway Timeout errors in kubectl.
  • 2Used kubectl get crds | wc -l to list all CRs.
  • 3Analyzed controller logs – found infinite reconcile on a specific CR type.
  • 4etcd disk I/O was maxed.
Root Cause

Bad logic in reconcile loop: create was always called regardless of the state, creating resource floods.

Fix/Workaround
• Scaled the controller to 0 replicas.
• Manually deleted thousands of stale CRs using batch deletion.
Lessons Learned

Always test reconcile logic in a sandboxed cluster.

How to Avoid
  • 1Implement create/update guards in reconciliation.
  • 2Add Prometheus alert for high CR count.