What is a Distributed Configuration Service?
A configuration service manages application settings, feature flags, and operational parameters across a distributed system. Without it, configuration is hard-coded, buried in environment variables, or requires service restarts to update. With a config service: configs are stored centrally, updated in real time, and pushed to all service instances without deployment.
Requirements
- Store key-value configuration with versioning and audit trail
- Push config updates to all service instances within 5 seconds
- Namespace configs by environment (dev/staging/prod) and service
- Support feature flags: enable/disable features for % of users or specific user IDs
- High availability: service must be readable even when config service is down
Data Model
ConfigEntry(config_id, namespace, key, value TEXT, value_type ENUM(STRING,JSON,BOOL,INT),
version INT, created_by, updated_at, description)
ConfigAudit(audit_id, config_id, old_value, new_value, changed_by, changed_at, reason)
FeatureFlag(flag_id, namespace, key, enabled BOOL, rollout_percent INT,
allowlist_users[], denylist_users[], conditions JSON)
Config Distribution
Two approaches:
- Pull model: clients poll the config service every 30s. Simple but 30s lag on updates. Good for config that rarely changes.
- Push model: clients maintain a long-polling connection or WebSocket to the config service. On config change, the service pushes to all connected clients immediately. Good for feature flags and operational toggles.
Hybrid (used by etcd/Consul): clients fetch full config on startup, then watch for changes using a change_index or revision. Long-poll: GET /config?wait_index=N. Server blocks until config_version > N, then returns immediately. Client updates local cache and re-polls with the new index. This provides near-real-time updates without a persistent WebSocket.
Local Cache and Fallback
Every service instance maintains a local in-memory cache of all config values. On startup: fetch full config, populate cache. On change notification: update specific keys in cache. If the config service is unreachable: serve stale cached values — never fail. The config service is in the read path of every service; if it’s a hard dependency and goes down, every service goes down. The local cache breaks this dependency. Persist the cache to disk (JSON file) so the service can restart even if the config service is down.
Feature Flags
Feature flags (feature toggles) enable deploying code without activating features. Implementation:
class FeatureFlagClient:
def is_enabled(self, flag_key, user_id=None):
flag = self.cache.get(flag_key)
if not flag: return False
if not flag.enabled: return False
if user_id and user_id in flag.allowlist: return True
if user_id and user_id in flag.denylist: return False
if flag.rollout_percent == 100: return True
if flag.rollout_percent == 0: return False
# Consistent hash: same user always gets same assignment
bucket = hash(f"{user_id}:{flag_key}") % 100
return bucket < flag.rollout_percent
Gradual rollout: increase rollout_percent from 0 → 1% → 10% → 50% → 100% while monitoring error rates. If something goes wrong, set rollout_percent=0 immediately (kill switch).
Versioning and Rollback
Every config update increments the version. Store full history in ConfigAudit table. Rollback: copy the old ConfigAudit.old_value back to ConfigEntry and increment version. The rollback itself is a new version (with a note in the reason field) — never delete config history. This enables: auditing who changed what and when, debugging config-related incidents, and restoring known-good configs after a bad change.
Namespacing
Namespace: {env}/{service}/{key}. Examples: prod/order-service/payment_timeout_ms = 5000, prod/global/maintenance_mode = false. Services fetch only their namespace + global namespace. Inheritance: service-specific config overrides global config for the same key. The client SDK handles namespace resolution transparently.
Key Design Decisions
- Local in-memory cache + disk fallback: config service availability must not affect service availability
- Long-polling watch: near-real-time updates without persistent WebSocket complexity
- Feature flags with consistent hashing: same user always sees same experience across requests/services
- Full audit trail: every config change is logged with who, what, when, and why
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why do microservices need a centralized configuration service?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Without a central config service: configuration is scattered across environment variables, config files per service, and deployment scripts. Changing one value requires updating all instances and redeploying. Problems: inconsistency (different instances have different configs), no audit trail (who changed what and when?), slow propagation (requires deployment), no dynamic updates (feature flags can't be toggled in real time). A centralized config service solves these: single source of truth, propagates updates to all instances within seconds, full audit trail, supports feature flags for instant kill switches, and namespaces configs by environment and service. Examples: etcd, Consul, AWS AppConfig, LaunchDarkly (for feature flags specifically).”}},{“@type”:”Question”,”name”:”How does a config service distribute updates to thousands of service instances?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Three patterns: (1) Long-polling: client calls GET /config?version=current_version. Server blocks until config_version > current_version (or timeout after 30s). When a config changes, the server responds to all waiting clients instantly. Clients immediately re-poll with the new version. Near-real-time with simple HTTP — no WebSocket needed. (2) WebSocket: persistent bidirectional connection. Server pushes updates immediately. More real-time but complex to manage at scale (10K instances = 10K persistent connections). (3) Pub/Sub: server publishes changes to a message bus (Kafka, Redis Pub/Sub); client SDKs subscribe. Decoupled, scales well. Most production systems use long-polling (etcd watch, Consul blocking queries) for the balance of simplicity and real-time delivery.”}},{“@type”:”Question”,”name”:”How do feature flags enable gradual rollouts without deployment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Feature flags are configuration keys that control whether a code path is active. Gradual rollout: set rollout_percent from 0 to 100. The client SDK computes: bucket = hash(user_id + flag_key) % 100; enabled = bucket < rollout_percent. The hash is deterministic — the same user always sees the same experience. Increase rollout while monitoring error rates: 1% → 10% → 50% → 100%. Kill switch: set rollout_percent=0 to instantly disable for all users without a deployment. Allowlist/denylist: specific user IDs can be forced into or out of the flag. User-targeted rollout: deploy to internal employees first (add their user_ids to allowlist), then to beta users, then to everyone. A/B testing: two feature flags with 50% rollout each + disjoint hash buckets = two non-overlapping cohorts.”}},{“@type”:”Question”,”name”:”How does a config service remain available when the config service itself is down?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Every service instance maintains a local in-memory cache of all config values. On startup: fetch full config from the service, populate cache. On update: apply delta to cache. If the config service is unreachable: serve stale cached values — never block. The config service must not be a hard dependency because it is in the critical path of every service. Persist the cache to disk (e.g., config_cache.json) so the service can restart and serve config even if the config service remains down. Accept staleness: a 30-minute-old cache is far better than a 503 for every user request. TTL on cache: refresh every 30s when service is available, serve stale indefinitely when unavailable. Monitor staleness: alert when cache_age > 10 minutes in production.”}},{“@type”:”Question”,”name”:”How do you implement config namespacing for multiple environments and services?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Namespace hierarchy: {environment}/{service}/{key}. Examples: prod/order-service/payment_timeout_ms, staging/global/maintenance_mode. Services fetch: their own namespace (prod/order-service/*) + global namespace (prod/global/*). Service-specific config overrides global for the same key. Environment isolation: prod, staging, and dev configs are completely separate — a staging config change never affects prod. Config inheritance: a key in prod/global/log_level = INFO applies to all services unless overridden by prod/order-service/log_level = DEBUG. The client SDK merges global and service configs at lookup time, with service-specific taking precedence. Access control: service accounts can only read their own namespace + global; only admins can write prod configs.”}}]}
Uber system design covers distributed config and feature flags. See common questions for Uber interview: configuration service and feature flag system design.
Atlassian system design covers distributed configuration for microservices. Review patterns for Atlassian interview: configuration service system design.
Databricks system design covers distributed config and feature flags. See design patterns for Databricks interview: distributed configuration and feature flag design.