System Design Interview: Configuration Management Service (etcd / Consul)

⏱ 8 min read

What Is a Configuration Management Service?

A configuration management service stores application configuration (database connection strings, feature toggle values, service endpoints, rate limit thresholds) in a centralized, distributed key-value store. Services read config from this store at startup and receive live updates when config changes — no restart required. Production systems like Kubernetes, Kafka, and Consul use etcd as their coordination and configuration backbone. This question tests knowledge of distributed consensus, consistency models, and real-time update propagation.

Requirements

Store key-value configuration with namespacing (e.g., /production/database/url)
Strong consistency: all clients read the same value after a write is acknowledged
Watch notifications: clients subscribe to a key prefix and receive notifications when values change
High availability: survive minority node failures without service interruption
Version history: store previous values for audit and rollback
Access control: different services can only read/write their own config namespace

Raft Consensus

Strong consistency across a distributed system requires consensus — all nodes agree on the same sequence of writes. etcd uses the Raft consensus algorithm. A cluster of N nodes (typically 3 or 5) elects one leader. All writes go through the leader: the leader appends the write to its log, replicates it to a majority of followers (N/2 + 1), and only acknowledges the write to the client after the majority confirms. This ensures: (1) Durability — a write that is acknowledged will survive the failure of any minority of nodes. (2) Ordering — all writes are totally ordered by the leader. (3) Safety — no two leaders can exist simultaneously (split-brain is prevented because a candidate needs votes from a majority to become leader, and there can only be one majority).

Watch Notifications

Watch is the killer feature of etcd and Zookeeper — clients register to receive a notification when a key or key prefix changes. Implementation: each client opens a long-lived gRPC stream to the config service. When a client calls Watch(prefix), the server registers the watch and stores the client stream and prefix in a watch registry. When any key under that prefix is written, the server fans out to all watches matching the prefix, sending the changed key, new value, and version number over the registered streams. This push model eliminates polling — services react to config changes in milliseconds without flooding the server with requests. Watches survive temporary disconnections: the client stores the last received revision, and on reconnect sends this revision so the server replays any missed events.

Versioning and History

etcd assigns a monotonically increasing global revision number to every write. Each key stores its current value plus the revision at which it was last modified. The full write history is stored in a BoltDB B-tree, enabling: (1) Consistent reads at a specific revision — a service can read the config as it was at revision 1000, even if later writes have occurred. (2) Transactional reads — a batch of reads at the same revision sees a consistent snapshot of the config store. (3) Rollback — revert to a previous config version by looking up the value at that revision and writing it as a new value. (4) Audit log — query the history of any key to see all changes, who made them (via the write origin stored as metadata), and when.

Hot Config Reloading

Services use the watch API to reload config without restarting. Pattern: at startup, load all config from the config service (using Get with the current revision). Register a watch on the config namespace. In a background goroutine (or thread), process incoming watch events: when a key changes, validate the new value (type-check, range-check) and atomically swap it into the running config. Atomicity matters — a service reading config during a partial update could see inconsistent values. Use a read-write lock: writers hold the write lock while updating the entire config struct; readers hold the read lock and block only when a writer is active. Failed validation (invalid type, out-of-range value) keeps the old config and alerts the operator — never crash on a bad config update.

Access Control

Different services must only access their own config namespace. etcd RBAC: roles are defined with rules (key prefix pattern + allowed operations: read, write, delete). Role bindings assign roles to users or service accounts. Each service authenticates with a service account certificate (mutual TLS) and is assigned to a role that only allows access to /production/service-name/. The config service validates the identity and permissions on every request. Secrets (passwords, API keys) stored in the config service should be encrypted at rest (envelope encryption: each value encrypted with a data encryption key, which is encrypted with a master key stored in a KMS) and never logged or exposed in watch events to unauthorized clients.

Interview Tips

Raft consensus is the expected answer for strong consistency — know leader election and majority quorum
Watch notifications via long-lived gRPC streams is the push model — better than polling
Revision-based versioning enables consistent snapshots and rollback
Hot config reload requires atomic swap with a RW lock — partial updates corrupt state
This question often comes up in context of Kubernetes internals — etcd stores all cluster state

Frequently Asked Questions

How does etcd achieve strong consistency using Raft?

etcd uses the Raft consensus algorithm to ensure all nodes agree on the same sequence of writes. A cluster of N nodes (typically 3 or 5) elects one leader through a vote. All writes are routed to the leader: the leader appends the write to its log, sends it to followers, and waits for a majority (N/2 + 1) to confirm before acknowledging to the client. A write acknowledged by the leader is guaranteed to survive the failure of any minority of nodes — the surviving majority will always have the write in their logs and will elect a new leader with it. No two leaders can coexist simultaneously, because becoming leader requires votes from a majority, and only one group can constitute a majority. This gives linearizability: every read sees all writes that were acknowledged before it.

How do applications receive live configuration updates without restarting?

Applications use the Watch API provided by etcd, Consul, or similar config services. On startup, the app reads its full configuration and registers a watch on its config key prefix (e.g., /production/my-service/). The config service returns a stream (gRPC streaming in etcd) over which it pushes events whenever any watched key changes. The app receives a watch event containing the changed key, the new value, and the global revision number. In a background goroutine, the app validates the new value (type-check, range-check), then atomically swaps it into the live config using a read-write lock (readers take a read lock; the config-updater takes a write lock). If validation fails, the old value is kept and an alert is fired — the app never crashes on a bad config push. This enables operators to change rate limits, feature flags, and connection parameters in production without service restarts.

What is the difference between etcd and Zookeeper?

Both etcd and Zookeeper are distributed key-value stores providing strong consistency via consensus (Raft for etcd, ZAB for Zookeeper), watch notifications, and leader election primitives. Key differences: (1) API simplicity: etcd has a simple key-value API (get, put, delete, watch, lease); Zookeeper has a hierarchical node (znode) API with data, children, and watches — more powerful but more complex. (2) Performance: etcd is significantly faster for high-throughput workloads; Zookeeper was designed for coordination (hundreds of operations per second), not config serving (thousands). (3) Language and ecosystem: etcd is written in Go with a native gRPC API; Zookeeper is Java-based. (4) Kubernetes adoption: Kubernetes uses etcd as its sole data store for all cluster state — etcd has become the de facto standard for cloud-native infrastructure. Zookeeper remains prevalent in JVM-centric stacks (Kafka, HBase, Hadoop).