How does configuration override hierarchy resolution order work in a config management system?

Config override hierarchy typically resolves from most specific to least specific: runtime flags override environment-level configs, which override service-level configs, which override global defaults. The system evaluates each layer at request time, merging keys with the highest-priority layer winning. Common layering is: default < global < datacenter < service < environment < instance < feature flag. Keys missing at a higher layer fall through to the next. Explicit null values can block fallthrough when needed.

What are the tradeoffs between real-time config push and polling in a configuration management system?

Push (long-poll, SSE, gRPC stream) delivers changes with sub-second latency and is efficient at scale since only deltas are sent, but requires persistent connections and careful handling of reconnects and ordering. Polling is simpler, stateless, and tolerant of server restarts, but introduces latency proportional to the poll interval and generates steady load even when nothing changes. A hybrid approach is common: push for low-latency critical configs, with a periodic poll as a consistency backstop to catch any missed events.

How should config validation be performed before applying changes to production?

Validation should run in a pipeline before any write is committed: schema validation (type checking, required fields, value ranges), semantic validation (cross-key consistency rules), and dry-run application against a shadow environment. Changes should require approval and pass automated canary checks on a small traffic slice before broad rollout. Deploying configs through the same CI/CD pipeline as code—with lint, unit tests on config logic, and integration tests—catches regressions early. Immutable versioned snapshots ensure every change is auditable.

How do you safely roll back to a previous config version in a config management system?

Store every config write as an immutable versioned snapshot with a monotonically increasing version ID, author, timestamp, and diff. Rollback is a new write that points the current pointer to a prior snapshot—it never mutates history. Clients subscribe to version changes and re-fetch on update. For safety, rollback should trigger the same validation and canary pipeline as forward changes. Keep a minimum retention window (e.g., 90 days) so rollback targets are always available. Distributed config stores like etcd or Consul support this natively with their revision history.

How do you integrate secrets management with a configuration management system?

Secrets (API keys, credentials, TLS certs) should never be stored in the same plaintext store as non-sensitive config. The recommended pattern is to store only a secret reference (e.g., a Vault path or AWS Secrets Manager ARN) in the config system, and have the application or a sidecar resolve the reference at startup or rotation time. Access is controlled by IAM roles or Vault policies scoped to each service. The config system itself should enforce that values matching secret patterns are rejected unless they are references. Audit logs on secret access are separate from config audit logs.

Low Level Design: Configuration Management System

Q: What are the tradeoffs between real-time config push and polling in a configuration management system?

Push (long-poll, SSE, gRPC stream) delivers changes with sub-second latency and is efficient at scale since only deltas are sent, but requires persistent connections and careful handling of reconnects and ordering. Polling is simpler, stateless, and tolerant of server restarts, but introduces latency proportional to the poll interval and generates steady load even when nothing changes. A hybrid approach is common: push for low-latency critical configs, with a periodic poll as a consistency backstop to catch any missed events.

Q: How should config validation be performed before applying changes to production?

Validation should run in a pipeline before any write is committed: schema validation (type checking, required fields, value ranges), semantic validation (cross-key consistency rules), and dry-run application against a shadow environment. Changes should require approval and pass automated canary checks on a small traffic slice before broad rollout. Deploying configs through the same CI/CD pipeline as code—with lint, unit tests on config logic, and integration tests—catches regressions early. Immutable versioned snapshots ensure every change is auditable.

Q: How do you safely roll back to a previous config version in a config management system?

Store every config write as an immutable versioned snapshot with a monotonically increasing version ID, author, timestamp, and diff. Rollback is a new write that points the current pointer to a prior snapshot—it never mutates history. Clients subscribe to version changes and re-fetch on update. For safety, rollback should trigger the same validation and canary pipeline as forward changes. Keep a minimum retention window (e.g., 90 days) so rollback targets are always available. Distributed config stores like etcd or Consul support this natively with their revision history.

Q: How do you integrate secrets management with a configuration management system?

Secrets (API keys, credentials, TLS certs) should never be stored in the same plaintext store as non-sensitive config. The recommended pattern is to store only a secret reference (e.g., a Vault path or AWS Secrets Manager ARN) in the config system, and have the application or a sidecar resolve the reference at startup or rotation time. Access is controlled by IAM roles or Vault policies scoped to each service. The config system itself should enforce that values matching secret patterns are rejected unless they are references. Audit logs on secret access are separate from config audit logs.

⏱ 9 min read

Config Data Model

A configuration management system stores each configuration entry as a structured record with the following fields: key (the config identifier), value (the config value as a string), value_type (string, integer, boolean, JSON, YAML), environment (dev, staging, prod), service (which microservice owns this config), version (monotonically incrementing integer), created_by (user or service account), and created_at (timestamp).

The system enforces a hierarchical override model. At the lowest precedence is the global default, which applies to all services in all environments. Above that is the service default, which applies to a specific service across all environments. Next is the environment override, which applies to a specific service in a specific environment. At the highest precedence is the instance override, which applies to a single running instance identified by hostname or pod name. When a client reads a config key, the resolver walks this hierarchy from highest to lowest and returns the first match found. This allows a production environment to diverge from development without any code changes, and allows a single misbehaving instance to be reconfigured without affecting others.

Versioning

Every change to a config key creates a new version record rather than overwriting the old value. The version record stores the new value, the author, a commit message describing why the change was made, and the timestamp. The previous version record is retained indefinitely, making the full history queryable.

A current_version pointer per key (or per service namespace) tracks which version is active. When a change is applied, a new version row is inserted and the current_version pointer is updated atomically in a single transaction. This prevents clients from reading a half-applied update. Version history is queryable per key (show all versions of payment.timeout_ms) or per service (show all config changes to the payment service in the last 30 days). Version numbers are monotonically increasing integers, not timestamps, to avoid clock skew issues across distributed writers.

Real-Time Push

Clients subscribe to a config namespace (e.g., all keys for the payment service in prod) via a persistent connection. Two common transport options: gRPC server-side streaming, where the client opens a stream and the config server pushes ConfigChangeEvent messages containing the key, new value, and new version; and Server-Sent Events (SSE), which is simpler to implement and works through HTTP/2 proxies.

When a config change is committed, the server identifies all subscribed clients for the affected namespace and pushes a delta containing only the changed keys. Clients apply the delta to their in-memory config map without a full reload. On disconnect or reconnect, the client sends its last known version number; the server replays all changes since that version so no update is lost. If the config server is unreachable, clients continue operating on the last known config and log a warning rather than failing — stale config is almost always preferable to a service crash. For Kubernetes workloads, a Kubernetes ConfigMap populated by the config server is a container-native alternative, with pods reloading on ConfigMap update via a sidecar or init container.

Environment Override Resolution

The resolution algorithm runs at read time, not write time. When a client requests the value of key K for service S running as instance I in environment E, the resolver executes these lookups in order and returns the first result found:

Instance override: key=K, service=S, environment=E, instance=I
Environment override: key=K, service=S, environment=E
Service default: key=K, service=S
Global default: key=K

If none match, the key is undefined and the client falls back to a hardcoded default in the application binary. This four-level hierarchy means production can have a different database pool size than staging without any code branching, and a single canary instance can receive experimental config values. The hierarchy is enforced at read time so retroactively adding an override at a higher level immediately takes effect without republishing lower-level values.

Config Validation

The config service enforces schema validation on every write. Each key is registered with a schema that specifies its value_type (integer, float, boolean, string, JSON object, YAML), optional range constraints (min/max for numerics), and an allowed_values list for enum-style keys. A write that violates the schema is rejected with a descriptive error before any version record is created.

A dry-run mode lets operators simulate a config change: the system validates the new value, identifies all subscribers that would receive the push, and returns a summary of impact without committing. For structured config values (keys whose value_type is JSON or YAML), the service parses and validates the content rather than treating it as an opaque string — a malformed JSON object is caught at write time, not at runtime in every client. Production config changes can be gated behind a reviewer approval flow: the change is submitted as a pending proposal, a designated approver reviews and approves it, and only then does the commit happen. This is enforced by the service, not by convention.

Rollback

Rollback is a first-class operation exposed via API: POST /configs/{key}/rollback with a target version number. The service atomically updates the current_version pointer to point at the target version. All subscribed clients receive a push event containing the reverted value, identical to a forward change event. The rollback is itself recorded as a new version entry in the history (so the history is append-only and the rollback is auditable), with the rollback reason stored in the commit message field.

Time-based rollback is also supported: POST /configs/{service}/rollback?timestamp=2024-01-15T14:30:00Z reverts the entire service namespace to the state it was in at that timestamp. The system determines which version of each key was active at that time and issues individual rollback operations for each key that has changed since then. This is useful for recovering from a bad deployment where config and code were changed simultaneously.

Audit Logging

Every write to the config system generates an audit log entry recording: who made the change (user identity or service account), what key was changed, what the old value was, what the new value is, when the change occurred, from which IP address, and via which interface (API, UI, automated pipeline). Read access for sensitive keys is also logged. Audit logs are append-only and stored in a separate system from the config data itself — tampering with config history should not be possible by someone who only has config write access.

The audit system triggers change notifications: service owners receive an email or Slack message when any key in their service namespace is modified, so they are not surprised by a config change they did not make. The system also monitors the rate of change: if more than N config keys change in a short window (a "config storm"), it fires an alert. This catches runaway automation that is bulk-updating configs incorrectly. Compliance use cases (SOC 2, PCI) require audit logs to be retained for at least 12 months.

Secret Integration

Sensitive values — database passwords, API keys, TLS certificates — must not be stored in config history as plaintext. Instead, the config service stores a secret reference: a pointer to the secret’s location in a secret management system such as HashiCorp Vault or AWS Secrets Manager. The value stored in the config DB looks like vault:secret/payment/db_password rather than the actual credential.

When a client requests a key whose value is a secret reference, the config service resolves the reference at read time by calling the secret management system and returning the actual value to the client over an encrypted channel. The secret value itself never appears in config history, audit logs show only the reference string, and secret rotation in Vault automatically propagates to all config consumers without any config change: the reference stays the same, the resolved value changes. This cleanly separates the concerns of config management (what key maps to what secret location) from secret management (what the actual credential value is).