Configuration Management System Low-Level Design

What is a Configuration Management System?

A configuration management system externalizes application settings — database URLs, feature flags, API keys, timeouts, thresholds — so they can be changed without redeploying code. Used by Netflix (Archaius), Atlassian (Launchdarkly), and virtually every large-scale system. Key properties: atomic updates (all services see the new config at the same time), versioning (roll back a bad config), audit trail (who changed what), and fast reads (every service reads config on startup and periodically).

Requirements

  • Store key-value configs with namespacing (service/env/key)
  • Read config: <5ms latency (services poll or subscribe)
  • Push updates to all subscribers within 5 seconds of a change
  • Version history and rollback to any previous version
  • Encrypt sensitive values (API keys, DB passwords)
  • Access control: developers can update dev configs; only ops can update prod configs

Data Model

ConfigItem(
    config_id   UUID PRIMARY KEY,
    namespace   VARCHAR NOT NULL,  -- 'payment-service/prod'
    key         VARCHAR NOT NULL,
    value       TEXT,              -- plaintext or encrypted blob
    is_encrypted BOOL DEFAULT false,
    version     INT NOT NULL,
    created_by  UUID,
    created_at  TIMESTAMPTZ,
    UNIQUE (namespace, key)       -- one active value per key
)

ConfigVersion(
    version_id  UUID PRIMARY KEY,
    namespace   VARCHAR,
    key         VARCHAR,
    value       TEXT,
    version     INT,
    changed_by  UUID,
    changed_at  TIMESTAMPTZ,
    change_note VARCHAR
)
-- append-only history; ConfigItem holds the current value

Read Path: Local Cache + Long Polling

# Service startup: load all configs for namespace
def init_config(namespace):
    configs = config_service.get_all(namespace)
    local_cache = {c.key: c.value for c in configs}
    last_version = max(c.version for c in configs)

    # Start background thread for updates
    threading.Thread(target=watch_configs,
                     args=(namespace, last_version), daemon=True).start()
    return local_cache

# Long-polling: server holds the request open until a change occurs
def watch_configs(namespace, since_version):
    while True:
        try:
            # Server blocks until version > since_version or timeout (30s)
            response = config_service.watch(namespace, since_version, timeout=30)
            if response.has_changes:
                for change in response.changes:
                    local_cache[change.key] = change.value
                since_version = response.latest_version
        except Exception:
            time.sleep(5)  # retry on failure

Push Architecture (Alternative)

For faster propagation: when a config changes, publish to a Kafka topic or Redis Pub/Sub channel. Services subscribe to their namespace channel and update local cache immediately:

# On config change (server side)
kafka.produce(f'config-updates:{namespace}', {
    'key': key, 'value': new_value, 'version': new_version
})

# Service side
for message in kafka.consume(f'config-updates:{namespace}'):
    local_cache[message['key']] = message['value']

Kafka/Pub-Sub gives near-instant propagation (<1s) vs long-polling’s 30s worst-case.

Secret Management

API keys and DB passwords need encryption at rest and rotation support:

# Encryption: envelope encryption
data_key = kms.generate_data_key()           # AWS KMS or Vault
encrypted_value = AES256.encrypt(secret, data_key.plaintext)
stored_value = {
    'ciphertext': base64(encrypted_value),
    'encrypted_data_key': base64(data_key.ciphertext_blob)
}

# Decryption: decrypt data key with KMS, then decrypt value
data_key_plaintext = kms.decrypt(stored_value['encrypted_data_key'])
secret = AES256.decrypt(stored_value['ciphertext'], data_key_plaintext)

For production: use AWS Secrets Manager or HashiCorp Vault — they handle encryption, rotation, and audit logs out of the box.

Key Design Decisions

  • Local cache in every service — config reads are O(1) from memory; no network call per request
  • Versioned config history — essential for rollback when a bad config causes an incident
  • Long-polling or Pub/Sub for updates — push is faster than periodic polling; services don’t miss changes
  • Namespace hierarchy (service/env/key) — prevents key collisions across services, enables per-environment configs
  • Envelope encryption for secrets — KMS rotates the master key without re-encrypting all values

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do services get notified of config changes in real time?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two approaches: (1) Long polling — the service sends a GET request that blocks until a config change occurs or a timeout (30s) is reached. On change, the server responds immediately; the client processes the update and immediately sends another long poll. (2) Pub/Sub — the config service publishes changes to a Kafka topic or Redis channel; services subscribe and receive updates in under 1 second. Long polling is simpler to implement; Pub/Sub is faster and more scalable for many subscribers.”}},{“@type”:”Question”,”name”:”How do you store secrets (API keys, DB passwords) securely in a config system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use envelope encryption: generate a data encryption key (DEK), encrypt the secret with AES-256, then encrypt the DEK with a master key from KMS (AWS KMS, Google Cloud KMS, or HashiCorp Vault). Store the encrypted secret and the encrypted DEK. To decrypt: call KMS to decrypt the DEK, then decrypt the secret locally. The master key never leaves KMS. In production, use AWS Secrets Manager or HashiCorp Vault — they handle encryption, automatic rotation, and access audit logs out of the box.”}},{“@type”:”Question”,”name”:”How do you roll back a bad configuration change?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Store every config change in an immutable ConfigVersion table with the changed_by, changed_at, and change_note fields. The rollback operation: find the ConfigVersion entry for the desired previous state, write that value back as a new ConfigVersion entry (don’t overwrite — maintain history), and push the update to all subscribers. This approach maintains a complete audit trail and allows rolling back to any point in history.”}},{“@type”:”Question”,”name”:”How do you handle configuration for multiple environments (dev/staging/prod)?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use namespace hierarchy: {service}/{environment}/{key}. Example: payment-service/prod/db_url. Services read their namespace at startup based on an environment variable (APP_ENV=prod). Access control is applied at the namespace level: developers can write to */dev and */staging; only ops or CI/CD can write to */prod. Default values: services can define defaults in code; config service overrides take precedence. This prevents missing-config crashes in new environments.”}},{“@type”:”Question”,”name”:”How does Netflix Archaius or LaunchDarkly handle config at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Netflix Archaius uses a polling model: services reload config from a property source (Zookeeper, DynamoDB, or a custom URL) every 60 seconds. Changes propagate in up to 60 seconds. Services cache the config locally and use atomic CAS updates to avoid thread safety issues. LaunchDarkly uses a streaming architecture (SSE/WebSocket) for real-time flag updates with sub-second propagation, plus a local SDK cache that falls back to cached values if the connection is lost — ensuring zero downtime even if LaunchDarkly’s service is unreachable.”}}]}

Configuration management and dynamic config systems are discussed in Netflix system design interview guide.

Feature flags and configuration management are covered in Atlassian system design interview questions.

Distributed configuration management design is discussed in Databricks system design interview preparation.

Scroll to Top