Back to all scenarios
Scenario #356
Storage
Kubernetes v1.23, Ceph CSI
Ceph RBD Volume Crashes Pods Under IOPS Saturation
Under heavy I/O, Ceph volumes became unresponsive, leading to kernel-level I/O errors in pods.
Find this helpful?
What Happened
Application workload created sustained random writes. Ceph cluster’s IOPS limit was reached.
Diagnosis Steps
- 1dmesg logs: blk_update_request: I/O error.
- 2Pod logs: database fsync errors.
- 3Ceph health: HEALTH_WARN: slow ops.
Root Cause
Ceph RBD pool under-provisioned for the workload.
Fix/Workaround
• Migrated to SSD-backed Ceph pools.
• Throttled application concurrency.
Lessons Learned
Distributed storage systems fail silently under stress.
How to Avoid
- 1Benchmark storage before rollout.
- 2Alert on high RBD latency.