What is a Configuration Management System?
A configuration management system externalizes application settings — database URLs, feature flags, API keys, timeouts, thresholds — so they can be changed without redeploying code. Used by Netflix (Archaius), Atlassian (Launchdarkly), and virtually every large-scale system. Key properties: atomic updates (all services see the new config at the same time), versioning (roll back a bad config), audit trail (who changed what), and fast reads (every service reads config on startup and periodically).
Requirements
- Store key-value configs with namespacing (service/env/key)
- Read config: <5ms latency (services poll or subscribe)
- Push updates to all subscribers within 5 seconds of a change
- Version history and rollback to any previous version
- Encrypt sensitive values (API keys, DB passwords)
- Access control: developers can update dev configs; only ops can update prod configs
Data Model
ConfigItem(
config_id UUID PRIMARY KEY,
namespace VARCHAR NOT NULL, -- 'payment-service/prod'
key VARCHAR NOT NULL,
value TEXT, -- plaintext or encrypted blob
is_encrypted BOOL DEFAULT false,
version INT NOT NULL,
created_by UUID,
created_at TIMESTAMPTZ,
UNIQUE (namespace, key) -- one active value per key
)
ConfigVersion(
version_id UUID PRIMARY KEY,
namespace VARCHAR,
key VARCHAR,
value TEXT,
version INT,
changed_by UUID,
changed_at TIMESTAMPTZ,
change_note VARCHAR
)
-- append-only history; ConfigItem holds the current value
Read Path: Local Cache + Long Polling
# Service startup: load all configs for namespace
def init_config(namespace):
configs = config_service.get_all(namespace)
local_cache = {c.key: c.value for c in configs}
last_version = max(c.version for c in configs)
# Start background thread for updates
threading.Thread(target=watch_configs,
args=(namespace, last_version), daemon=True).start()
return local_cache
# Long-polling: server holds the request open until a change occurs
def watch_configs(namespace, since_version):
while True:
try:
# Server blocks until version > since_version or timeout (30s)
response = config_service.watch(namespace, since_version, timeout=30)
if response.has_changes:
for change in response.changes:
local_cache[change.key] = change.value
since_version = response.latest_version
except Exception:
time.sleep(5) # retry on failure
Push Architecture (Alternative)
For faster propagation: when a config changes, publish to a Kafka topic or Redis Pub/Sub channel. Services subscribe to their namespace channel and update local cache immediately:
# On config change (server side)
kafka.produce(f'config-updates:{namespace}', {
'key': key, 'value': new_value, 'version': new_version
})
# Service side
for message in kafka.consume(f'config-updates:{namespace}'):
local_cache[message['key']] = message['value']
Kafka/Pub-Sub gives near-instant propagation (<1s) vs long-polling’s 30s worst-case.
Secret Management
API keys and DB passwords need encryption at rest and rotation support:
# Encryption: envelope encryption
data_key = kms.generate_data_key() # AWS KMS or Vault
encrypted_value = AES256.encrypt(secret, data_key.plaintext)
stored_value = {
'ciphertext': base64(encrypted_value),
'encrypted_data_key': base64(data_key.ciphertext_blob)
}
# Decryption: decrypt data key with KMS, then decrypt value
data_key_plaintext = kms.decrypt(stored_value['encrypted_data_key'])
secret = AES256.decrypt(stored_value['ciphertext'], data_key_plaintext)
For production: use AWS Secrets Manager or HashiCorp Vault — they handle encryption, rotation, and audit logs out of the box.
Key Design Decisions
- Local cache in every service — config reads are O(1) from memory; no network call per request
- Versioned config history — essential for rollback when a bad config causes an incident
- Long-polling or Pub/Sub for updates — push is faster than periodic polling; services don’t miss changes
- Namespace hierarchy (service/env/key) — prevents key collisions across services, enables per-environment configs
- Envelope encryption for secrets — KMS rotates the master key without re-encrypting all values
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do services get notified of config changes in real time?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two approaches: (1) Long polling — the service sends a GET request that blocks until a config change occurs or a timeout (30s) is reached. On change, the server responds immediately; the client processes the update and immediately sends another long poll. (2) Pub/Sub — the config service publishes changes to a Kafka topic or Redis channel; services subscribe and receive updates in under 1 second. Long polling is simpler to implement; Pub/Sub is faster and more scalable for many subscribers.”}},{“@type”:”Question”,”name”:”How do you store secrets (API keys, DB passwords) securely in a config system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use envelope encryption: generate a data encryption key (DEK), encrypt the secret with AES-256, then encrypt the DEK with a master key from KMS (AWS KMS, Google Cloud KMS, or HashiCorp Vault). Store the encrypted secret and the encrypted DEK. To decrypt: call KMS to decrypt the DEK, then decrypt the secret locally. The master key never leaves KMS. In production, use AWS Secrets Manager or HashiCorp Vault — they handle encryption, automatic rotation, and access audit logs out of the box.”}},{“@type”:”Question”,”name”:”How do you roll back a bad configuration change?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Store every config change in an immutable ConfigVersion table with the changed_by, changed_at, and change_note fields. The rollback operation: find the ConfigVersion entry for the desired previous state, write that value back as a new ConfigVersion entry (don’t overwrite — maintain history), and push the update to all subscribers. This approach maintains a complete audit trail and allows rolling back to any point in history.”}},{“@type”:”Question”,”name”:”How do you handle configuration for multiple environments (dev/staging/prod)?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use namespace hierarchy: {service}/{environment}/{key}. Example: payment-service/prod/db_url. Services read their namespace at startup based on an environment variable (APP_ENV=prod). Access control is applied at the namespace level: developers can write to */dev and */staging; only ops or CI/CD can write to */prod. Default values: services can define defaults in code; config service overrides take precedence. This prevents missing-config crashes in new environments.”}},{“@type”:”Question”,”name”:”How does Netflix Archaius or LaunchDarkly handle config at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Netflix Archaius uses a polling model: services reload config from a property source (Zookeeper, DynamoDB, or a custom URL) every 60 seconds. Changes propagate in up to 60 seconds. Services cache the config locally and use atomic CAS updates to avoid thread safety issues. LaunchDarkly uses a streaming architecture (SSE/WebSocket) for real-time flag updates with sub-second propagation, plus a local SDK cache that falls back to cached values if the connection is lost — ensuring zero downtime even if LaunchDarkly’s service is unreachable.”}}]}
Configuration management and dynamic config systems are discussed in Netflix system design interview guide.
Feature flags and configuration management are covered in Atlassian system design interview questions.
Distributed configuration management design is discussed in Databricks system design interview preparation.