Back to all scenarios
Scenario #57
Cluster Management
K8s v1.21, AWS EKS
Cluster-wide Service Outage Due to Missing ClusterRoleBinding
Cluster-wide service outage occurred after an automated change removed a critical ClusterRoleBinding.
Find this helpful?
What Happened
A misconfigured automation pipeline accidentally removed a ClusterRoleBinding, which was required for certain critical services to function.
Diagnosis Steps
- 1Analyzed service logs and found permission-related errors.
- 2Checked the RBAC configuration and found the missing ClusterRoleBinding.
Root Cause
Automated pipeline incorrectly removed the ClusterRoleBinding, causing service permissions to be revoked.
Fix/Workaround
• Restored the missing ClusterRoleBinding.
• Manually verified that affected services were functioning correctly.
Lessons Learned
Automation changes must be reviewed and tested to prevent accidental permission misconfigurations.
How to Avoid
- 1Use automated tests and checks for RBAC changes.
- 2Implement safeguards and approval workflows for automated configuration changes.