What is etcd?
etcd is the single source of truth for your entire Kubernetes cluster. Every object you create — Pods, Services, ConfigMaps, Secrets, RBAC roles — is serialized and stored in etcd. If you lose etcd without a backup, you lose your cluster. Period.
It is a distributed, strongly consistent key-value store built on the Raft consensus algorithm. The Kubernetes API server is the only component that talks to etcd directly; everything else reads and writes through the API server.
"etcd is to Kubernetes what a database is to a web application — except losing the database means losing the entire system state."
Kubernetes control plane architecture — etcd cluster (right) is accessed exclusively by the API server.
How Kubernetes Uses etcd
All cluster state lives under the /registry key prefix. Every resource type gets its own hierarchical path. Here's a snapshot of what lives in etcd:
| etcd Key | What It Stores |
|---|---|
/registry/pods/{namespace}/{name} | Pod spec and status |
/registry/services/specs/{ns}/{name} | Service definitions |
/registry/secrets/{ns}/{name} | Secrets (base64, optionally encrypted) |
/registry/configmaps/{ns}/{name} | ConfigMaps |
/registry/deployments/apps/{ns}/{name} | Deployment specs |
/registry/leases/kube-system/{component} | Leader election leases |
You can inspect raw etcd data directly using etcdctl:
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
get /registry/pods --prefix --keys-onlyThe Raft Consensus Algorithm
etcd uses Raft to guarantee linearizable consistency across all members. Every write goes through the elected leader; followers only replicate and vote.
Raft write path: the leader replicates to followers, waits for a quorum of ACKs, then commits and replies to the client.
Why Odd Node Counts?
A 4-node cluster also requires 3 votes for quorum — the same as 3 nodes — but with added cost and complexity. Always use 3 or 5 in production.
Backup and Restore
Critical: Always backup etcd before cluster upgrades, node additions, or any destructive operation. Automate this with a CronJob — test restores regularly.
Taking a Snapshot
ETCDCTL_API=3 etcdctl snapshot save \
/backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyVerifying the Snapshot
etcdctl snapshot status /backup/etcd-snapshot.db --write-out=tableRestoring from Snapshot
# 1. Stop API server
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# 2. Restore to a new data dir
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restored \
--name=master \
--initial-cluster=master=https://127.0.0.1:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://127.0.0.1:2380
# 3. Update etcd manifest --data-dir, then restore API server
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/Encryption at Rest
By default, Secrets stored in etcd are only base64-encoded — not encrypted. Anyone with etcd access can read them. Enable encryption at rest via an EncryptionConfiguration resource:
# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
providers:
- aescbc:
keys:
- name: key1
secret: <base64-encoded-32-byte-key>
- identity: {}Pass this config to the API server and re-encrypt existing secrets:
# API server flag
--encryption-provider-config=/etc/kubernetes/encryption-config.yaml
# Re-encrypt all existing secrets
kubectl get secrets --all-namespaces -o json | kubectl replace -f -Performance Tuning
etcd is highly sensitive to disk I/O latency — it writes a WAL entry to disk on every committed write. On a slow disk, the Raft heartbeat can be missed, causing leader re-elections and degraded performance.
| Parameter | Default | Recommended |
|---|---|---|
--heartbeat-interval | 100ms | 250ms |
--election-timeout | 1000ms | 1250ms |
--quota-backend-bytes | 2GB | 8GB |
--auto-compaction-retention | 0 (off) | 1h |
# Compact old revisions
REV=$(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
etcdctl compact $REV
# Defragment (during low-traffic windows)
etcdctl defrag --endpoints=https://127.0.0.1:2379 [tls-flags]Monitoring etcd
etcd exposes Prometheus metrics on port 2381. These are the key metrics to alert on:
| Metric | What to Watch For |
|---|---|
etcd_server_leader_changes_seen_total | Should be near 0 — frequent re-elections = disk/network issue |
etcd_disk_wal_fsync_duration_seconds | p99 < 10ms — high latency = slow disk |
etcd_disk_backend_commit_duration_seconds | p99 < 25ms |
etcd_network_peer_round_trip_time_seconds | < 10ms within same datacenter |
etcd_mvcc_db_total_size_in_bytes | Alert at 80% of quota |
Common Failure Scenarios
Cluster Loses Quorum
If more than half your etcd nodes are down, the cluster becomes read-only. Kubernetes cannot schedule Pods or update any objects. Restore from a snapshot or repair the failed members.
NOSPACE Alarm
Symptom: etcdserver: mvcc: database space exceeded. etcd rejected all writes because the keyspace exceeded the storage quota.
# 1. Compact old revisions
etcdctl compact $(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
# 2. Defrag all members
etcdctl defrag --endpoints=...
# 3. Disarm the alarm
etcdctl alarm disarm --endpoints=...Security Checklist
- TLS certificates for client-to-server and peer-to-peer communication
- Separate PKI for etcd — don't reuse the Kubernetes CA
- Encryption at rest enabled for Secrets (and ConfigMaps)
- etcd port 2379 not reachable outside the control plane network
- Automated daily backups with tested restore runbooks
- Monitoring and alerting on leader elections and disk latency
- Storage quota configured with auto-compaction enabled