Back to all scenarios
Scenario #4
Cluster Management
K8s v1.25, Bare-metal cluster

Etcd Disk Full Causing API Server Timeout

etcd ran out of disk space, making API server unresponsive.

Find this helpful?
What Happened

The cluster started failing API requests. Etcd logs showed disk space errors, and API server logs showed failed storage operations.

Diagnosis Steps
  • 1Used df -h on etcd nodes — confirmed disk full.
  • 2Reviewed /var/lib/etcd – excessive WAL and snapshot files.
  • 3Used etcdctl to assess DB size.
Root Cause

Lack of compaction and snapshotting caused disk to fill up with historical revisions and WALs.

Fix/Workaround
bash
CopyEdit
etcdctl compact <rev>
etcdctl defrag
• Cleaned logs, snapshots, and increased disk space temporarily.
Lessons Learned

etcd requires periodic maintenance.

How to Avoid
  • 1Enable automatic compaction.
  • 2Monitor disk space usage of etcd volumes.