Back to all scenarios
Scenario #356
Storage
Kubernetes v1.23, Ceph CSI

Ceph RBD Volume Crashes Pods Under IOPS Saturation

Under heavy I/O, Ceph volumes became unresponsive, leading to kernel-level I/O errors in pods.

Find this helpful?
What Happened

Application workload created sustained random writes. Ceph cluster’s IOPS limit was reached.

Diagnosis Steps
  • 1dmesg logs: blk_update_request: I/O error.
  • 2Pod logs: database fsync errors.
  • 3Ceph health: HEALTH_WARN: slow ops.
Root Cause

Ceph RBD pool under-provisioned for the workload.

Fix/Workaround
• Migrated to SSD-backed Ceph pools.
• Throttled application concurrency.
Lessons Learned

Distributed storage systems fail silently under stress.

How to Avoid
  • 1Benchmark storage before rollout.
  • 2Alert on high RBD latency.