Back to all scenarios
Scenario #4
Cluster Management
K8s v1.25, Bare-metal cluster
Etcd Disk Full Causing API Server Timeout
etcd ran out of disk space, making API server unresponsive.
Find this helpful?
What Happened
The cluster started failing API requests. Etcd logs showed disk space errors, and API server logs showed failed storage operations.
Diagnosis Steps
- 1Used df -h on etcd nodes — confirmed disk full.
- 2Reviewed /var/lib/etcd – excessive WAL and snapshot files.
- 3Used etcdctl to assess DB size.
Root Cause
Lack of compaction and snapshotting caused disk to fill up with historical revisions and WALs.
Fix/Workaround
bash
CopyEdit
etcdctl compact <rev>
etcdctl defrag
• Cleaned logs, snapshots, and increased disk space temporarily.
Lessons Learned
etcd requires periodic maintenance.
How to Avoid
- 1Enable automatic compaction.
- 2Monitor disk space usage of etcd volumes.