Kubernetes etcd: The Brain Behind Your Cluster | Blog

What is etcd?

etcd is the single source of truth for your entire Kubernetes cluster. Every object you create — Pods, Services, ConfigMaps, Secrets, RBAC roles — is serialized and stored in etcd. If you lose etcd without a backup, you lose your cluster. Period.

It is a distributed, strongly consistent key-value store built on the Raft consensus algorithm. The Kubernetes API server is the only component that talks to etcd directly; everything else reads and writes through the API server.

"etcd is to Kubernetes what a database is to a web application — except losing the database means losing the entire system state."

Kubernetes control plane architecture — etcd cluster (right) is accessed exclusively by the API server.

How Kubernetes Uses etcd

All cluster state lives under the /registry key prefix. Every resource type gets its own hierarchical path. Here's a snapshot of what lives in etcd:

etcd Key	What It Stores
`/registry/pods/{namespace}/{name}`	Pod spec and status
`/registry/services/specs/{ns}/{name}`	Service definitions
`/registry/secrets/{ns}/{name}`	Secrets (base64, optionally encrypted)
`/registry/configmaps/{ns}/{name}`	ConfigMaps
`/registry/deployments/apps/{ns}/{name}`	Deployment specs
`/registry/leases/kube-system/{component}`	Leader election leases

You can inspect raw etcd data directly using etcdctl:

bash

ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  get /registry/pods --prefix --keys-only

The Raft Consensus Algorithm

etcd uses Raft to guarantee linearizable consistency across all members. Every write goes through the elected leader; followers only replicate and vote.

Raft write path: the leader replicates to followers, waits for a quorum of ACKs, then commits and replies to the client.

Why Odd Node Counts?

nodes

tolerates 1 failure

nodes

tolerates 2 failures

nodes

tolerates 3 failures

A 4-node cluster also requires 3 votes for quorum — the same as 3 nodes — but with added cost and complexity. Always use 3 or 5 in production.

Backup and Restore

⚠️

Critical: Always backup etcd before cluster upgrades, node additions, or any destructive operation. Automate this with a CronJob — test restores regularly.

Taking a Snapshot

bash

ETCDCTL_API=3 etcdctl snapshot save \
  /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Verifying the Snapshot

bash

etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table

Restoring from Snapshot

bash

# 1. Stop API server
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# 2. Restore to a new data dir
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored \
  --name=master \
  --initial-cluster=master=https://127.0.0.1:2380 \
  --initial-cluster-token=etcd-cluster-1 \
  --initial-advertise-peer-urls=https://127.0.0.1:2380

# 3. Update etcd manifest --data-dir, then restore API server
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

Encryption at Rest

By default, Secrets stored in etcd are only base64-encoded — not encrypted. Anyone with etcd access can read them. Enable encryption at rest via an EncryptionConfiguration resource:

yaml

# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
      - configmaps
    providers:
    - aescbc:
        keys:
        - name: key1
          secret: <base64-encoded-32-byte-key>
    - identity: {}

Pass this config to the API server and re-encrypt existing secrets:

bash

# API server flag
--encryption-provider-config=/etc/kubernetes/encryption-config.yaml

# Re-encrypt all existing secrets
kubectl get secrets --all-namespaces -o json | kubectl replace -f -

Performance Tuning

etcd is highly sensitive to disk I/O latency — it writes a WAL entry to disk on every committed write. On a slow disk, the Raft heartbeat can be missed, causing leader re-elections and degraded performance.

Parameter	Default	Recommended
`--heartbeat-interval`	100ms	250ms
`--election-timeout`	1000ms	1250ms
`--quota-backend-bytes`	2GB	8GB
`--auto-compaction-retention`	0 (off)	1h

bash

# Compact old revisions
REV=$(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
etcdctl compact $REV

# Defragment (during low-traffic windows)
etcdctl defrag --endpoints=https://127.0.0.1:2379 [tls-flags]

Monitoring etcd

etcd exposes Prometheus metrics on port 2381. These are the key metrics to alert on:

Metric	What to Watch For
`etcd_server_leader_changes_seen_total`	Should be near 0 — frequent re-elections = disk/network issue
`etcd_disk_wal_fsync_duration_seconds`	p99 < 10ms — high latency = slow disk
`etcd_disk_backend_commit_duration_seconds`	p99 < 25ms
`etcd_network_peer_round_trip_time_seconds`	< 10ms within same datacenter
`etcd_mvcc_db_total_size_in_bytes`	Alert at 80% of quota

Common Failure Scenarios

Cluster Loses Quorum

If more than half your etcd nodes are down, the cluster becomes read-only. Kubernetes cannot schedule Pods or update any objects. Restore from a snapshot or repair the failed members.

NOSPACE Alarm

💡

Symptom: etcdserver: mvcc: database space exceeded. etcd rejected all writes because the keyspace exceeded the storage quota.

bash

# 1. Compact old revisions
etcdctl compact $(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')

# 2. Defrag all members
etcdctl defrag --endpoints=...

# 3. Disarm the alarm
etcdctl alarm disarm --endpoints=...

Security Checklist

TLS certificates for client-to-server and peer-to-peer communication
Separate PKI for etcd — don't reuse the Kubernetes CA
Encryption at rest enabled for Secrets (and ConfigMaps)
etcd port 2379 not reachable outside the control plane network
Automated daily backups with tested restore runbooks
Monitoring and alerting on leader elections and disk latency
Storage quota configured with auto-compaction enabled

What is etcd?

How Kubernetes Uses etcd

The Raft Consensus Algorithm

Why Odd Node Counts?

Backup and Restore

Taking a Snapshot

Verifying the Snapshot

Restoring from Snapshot

Encryption at Rest

Performance Tuning

Monitoring etcd

Common Failure Scenarios

Cluster Loses Quorum

NOSPACE Alarm

Security Checklist

Further Reading