Low Level Design: Configuration Diff and Change Management Service

Configuration drift is one of the most common root causes of production incidents. A config diff and change management service enforces a proposal-review-rollout workflow that prevents unreviewed changes from reaching production and gives operators a one-click rollback when things go wrong.

Change Proposal Schema

CREATE TABLE config_changes (
  id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  namespace           VARCHAR(128) NOT NULL,   -- e.g. "payments-service"
  key                 VARCHAR(256) NOT NULL,
  old_value           TEXT,
  new_value           TEXT NOT NULL,
  proposer_id         UUID NOT NULL REFERENCES users(id),
  status              ENUM('draft','pending_approval','approved','rolling_out','applied','rejected','rolled_back') NOT NULL DEFAULT 'draft',
  rollout_percentage  SMALLINT NOT NULL DEFAULT 0,
  required_approvals  SMALLINT NOT NULL DEFAULT 1,
  created_at          TIMESTAMPTZ NOT NULL DEFAULT now(),
  applied_at          TIMESTAMPTZ
);

CREATE TABLE config_change_approvals (
  change_id   UUID NOT NULL REFERENCES config_changes(id),
  approver_id UUID NOT NULL REFERENCES users(id),
  decision    ENUM('approved','rejected') NOT NULL,
  comment     TEXT,
  decided_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (change_id, approver_id)
);

Diff Computation

The diff engine selects its algorithm based on the value type detected at proposal creation:

  • JSON values — parse both old and new as JSON objects and produce a structural diff listing added keys, removed keys, and changed leaf values. This makes it obvious whether a database timeout threshold changed from 5000 ms to 10000 ms, rather than showing a raw string diff.
  • Raw text / multi-line config — apply a Myers diff algorithm (same as git diff) and render a unified diff with three lines of context around each change hunk.
  • Scalar values — show old value and new value side-by-side with no diff rendering needed.

The computed diff is stored alongside the change proposal so that approvers see the exact diff at approval time, not a recomputed version that could differ if the base config changed in the meantime.

Approval Workflow

Each namespace defines a required_approvals count (default: 1, high-risk namespaces: 2+). The proposer cannot approve their own change. When a change proposal enters pending_approval status, the system notifies the designated approver group via webhook or email. Approvers review the stored diff and cast an approved or rejected decision. Once the approval count reaches required_approvals with no rejections, the change transitions to approved and becomes eligible for rollout. A single rejection from any required approver moves the change to rejected and notifies the proposer.

Staged Rollout

Approved changes do not apply to all instances at once. The rollout controller steps through configurable percentage gates:

  1. 10% canary. Apply the new config value to 10% of instances. Monitor error rate and latency for 5 minutes. If metrics breach thresholds, auto-rollback and alert on-call.
  2. 50% half-fleet. Expand if canary gate passes. Hold for 10 minutes.
  3. 100% full fleet. Complete rollout. Set status = 'applied' and record applied_at.

Instance targeting uses a consistent hash of the instance ID modulo 100 so that the same instance is always in the same percentage bucket — important for stateful config (e.g. feature flags that affect in-flight sessions). The rollout_percentage column tracks the current gate so restarts resume from the correct stage.

Change History and Rollback

Every row in config_changes is permanent — there are no updates to old_value or new_value after creation. This means the full history of every config key is queryable by filtering on namespace and key ordered by applied_at.

One-click rollback creates a new change proposal with old_value and new_value swapped from the change being reversed. It inherits the same required_approvals threshold — rollbacks are not exempt from review, because an incorrect rollback can be just as harmful as the original bad change. Emergency rollback with a single approver override is available and writes a mandatory incident reference to the audit log.

Compliance and Change Freeze

The compliance report endpoint accepts a time window and returns all applied changes grouped by namespace, including proposer, approvers, diff, and applied timestamp. This satisfies SOC 2 change management controls without manual record-keeping.

Change freeze periods (e.g. during a major product launch or end-of-quarter financial processing window) are stored in a freeze_windows table. The approval workflow checks for active freeze windows before allowing a change to proceed to rollout. Emergency changes bypass the freeze with mandatory CTO-level approval and an incident ticket reference stored in the proposal record.

Interview Talking Points

  • How do you handle config that differs between environments? Use separate namespaces per environment (payments-service.prod, payments-service.staging) with independent approval chains.
  • What if an instance is down during rollout? The rollout controller retargets only running instances; downed instances pick up the current applied config on restart by querying the service at startup.
  • How do you prevent a bad rollout from taking down the whole fleet? The percentage gate with automatic metric-based rollback limits blast radius to the canary slice.

Frequently Asked Questions

What is a configuration diff and change management service?

A configuration diff and change management service provides versioned storage and controlled delivery of application configuration, with tooling to compare versions (diff), approve changes before rollout, progressively deploy changes to a subset of hosts or traffic, and roll back instantly if a change causes a regression. It replaces ad-hoc config edits (SSH into a box and edit a file) with an auditable, peer-reviewed workflow similar to code review. Core components include a versioned config store (each write creates an immutable version), a diff engine (structured diff of JSON/YAML showing added/changed/removed keys), an approval workflow, a rollout engine for staged delivery, and a rollback mechanism that can reactivate any previous version in seconds.

How does an approval workflow prevent unsafe config changes?

The approval workflow gates promotion of a config version from draft to approved to active. When a change is submitted, the service generates a structured diff and routes a review request to designated approvers (configured per config namespace or risk tier). Approvers see exactly what changed (key name, old value, new value) rather than the full config, reducing review cognitive load. Automated pre-checks run in parallel: schema validation (the new config matches the expected JSON schema), linting rules (e.g., no negative TTLs, no missing required fields), and policy checks (certain high-risk keys require two approvers or a change window). Approval is recorded with approver identity and timestamp for the audit log. Once approved, the version can only be promoted by an authorized principal; the service rejects any attempt to push a non-approved version to production. Emergency break-glass overrides bypass approval but are logged, alerted, and require post-incident review.

How does staged rollout work for configuration changes?

Staged rollout delivers a new config version to an increasing percentage of the fleet over time, with health checks between stages. A typical progression: 1% of hosts (canary) → 10% → 50% → 100%, with a configurable soak time at each stage (e.g., 15 minutes) during which automated metrics (error rate, latency p99, business KPIs) are compared against baseline. The rollout engine tracks which hosts or service instances have received which config version by polling a version endpoint or using a push-based config distribution protocol (etcd watch, Consul watches, or a gRPC streaming API). If any soak-period health check fails, the rollout pauses automatically and alerts on-call. Rollout can be targeted by arbitrary dimensions: datacenter, availability zone, canary host group, or user traffic percentage (for configs consumed client-side). Each stage transition is recorded in the audit log with the metric snapshot that passed or failed.

How do you implement one-click rollback for a config change?

Because every config write creates an immutable versioned snapshot, rollback is simply re-promoting a previous version to active — no destructive operation is needed. The rollback flow: (1) The on-call engineer selects a known-good version from the version history UI; (2) The service creates a new version record pointing to the previous snapshot’s content (preserving audit continuity — rollback is itself a versioned event); (3) The new version is immediately distributed to all hosts via the existing push/pull distribution mechanism without requiring a separate approval step (rollback is a pre-authorized action); (4) Hosts apply the config within seconds of receiving the new version; (5) The rollback event is logged with the engineer’s identity, the target version, and the reason. To make rollback fast, the distribution path must be a push-based low-latency channel (e.g., etcd watch, long-poll endpoint) rather than a scheduled poll, so that a config change propagates across the fleet in under 30 seconds.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is a configuration diff and change management service?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A config diff and change management service provides versioned storage and controlled delivery of application configuration with diff tooling, approval workflows, staged rollout, and instant rollback. It replaces ad-hoc config edits with an auditable, peer-reviewed workflow.” } }, { “@type”: “Question”, “name”: “How does an approval workflow prevent unsafe config changes?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Changes move through draft -> approved -> active states. Approvers review structured diffs. Automated pre-checks run schema validation, linting, and policy checks (e.g., two approvers for high-risk keys). Approval is recorded in the audit log. Emergency overrides are allowed but logged and alerted.” } }, { “@type”: “Question”, “name”: “How does staged rollout work for configuration changes?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Deliver the new config version to increasing fleet percentages (1% canary -> 10% -> 50% -> 100%) with soak periods between stages. Automated health checks compare error rate and latency against baseline. If any check fails, rollout pauses automatically. Each stage transition is recorded in the audit log.” } }, { “@type”: “Question”, “name”: “How do you implement one-click rollback for a config change?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Re-promote a previous immutable version to active — no destructive operation needed. Create a new version record pointing to the previous snapshot, push it to all hosts via a low-latency channel (etcd watch, long-poll), and log the rollback event with engineer identity and reason. Fleet-wide propagation should complete in under 30 seconds.” } } ] }

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top