Feature Flag Rollout System Low-Level Design: Percentage Rollout, Metrics-Gated Progression, and Automated Rollback

Rollout Schema

A feature flag with gradual rollout has the following core fields:

{
  flag_id: "checkout-v2",
  rollout_plan: [
    {percentage: 1,   duration: "1h"},
    {percentage: 10,  duration: "4h"},
    {percentage: 50,  duration: "24h"},
    {percentage: 100}
  ],
  metrics_gates: [
    {metric: "error_rate",    threshold: 0.01, comparison: "lt"},
    {metric: "p99_latency_ms", threshold: 500, comparison: "lt"}
  ],
  targeting_rules: [],
  current_percentage: 0,
  status: "INACTIVE" | "ROLLING" | "PAUSED" | "COMPLETE" | "ROLLED_BACK"
}

The rollout_plan array defines the staged progression. metrics_gates define the conditions that must hold at each stage before advancing.

Percentage Rollout with Consistent Hashing

Determining which users are in a rollout must be consistent: the same user must always see the same experience for a given percentage. Using a random coin flip per request would cause users to see a feature flicker on and off.

The standard approach: hash(user_id + flag_id) mod 100. If the result is less than current_percentage, the flag is enabled for that user. The flag ID is included in the hash input so that different flags produce independent bucket assignments — a user in the top 10% for one flag is not automatically in the top 10% for all flags.

Automated Progression

A scheduler runs every few minutes and evaluates each ROLLING flag:

Check if the current cohort has been at the current percentage for at least duration
Evaluate all metrics_gates — compare current metric values for the treatment cohort against thresholds
If both conditions pass: advance current_percentage to the next step in rollout_plan
If the final step is reached: set status = COMPLETE
If any metrics gate fails: set status = PAUSED, send alert to on-call

Metrics Monitoring During Rollout

For each ROLLING flag, the metrics service continuously computes metrics for two cohorts:

Treatment: users where hash(user_id + flag_id) mod 100 < current_percentage
Control: all other users

Metrics include: error rate, p99 latency, conversion rate, and any business-specific KPIs defined on the flag. The comparison is treatment vs. control, not treatment vs. a historical baseline — this accounts for time-of-day and seasonal effects. Metrics are pulled from the observability platform (Datadog, Prometheus) via API.

Automated Rollback

If a metrics gate breaches its threshold beyond a configurable severity (e.g., error rate exceeds 3x the gate threshold, not just above it): automatic rollback kicks in without waiting for the scheduler cycle. The flag's current_percentage is set to 0 and status is set to ROLLED_BACK. Automated rollback is limited to flags where it is explicitly enabled — some features have side effects (database migrations, email sends) where rollback requires human judgment.

Manual Controls

Operators can intervene at any time via API or dashboard:

Pause: stop automatic progression; current percentage holds; metrics monitoring continues
Resume: restart progression from current percentage
Override percentage: manually set current_percentage to any value
Rollback: set current_percentage = 0, status = ROLLED_BACK
Force complete: set current_percentage = 100, skip remaining gates

Targeting Rules

Before percentage-based evaluation, targeting rules filter which users are eligible for the rollout at all:

Internal employees (by email domain) — always first
Beta user segment (opt-in list)
Geographic region (by IP geolocation or user profile country)
All users (the final stage of a typical rollout sequence)

Targeting rules are evaluated as a priority-ordered list; the first matching rule determines eligibility. Users who are not in the eligible segment are always in the control group.

Flag Evaluation SDK

Application code evaluates flags via a client SDK. The SDK downloads the full flag configuration from the flag service on startup and caches it locally. Flag evaluation is local — no network call per flag check. The config is refreshed in the background every 30 seconds via polling or server-sent events. Local evaluation means flag checks add sub-microsecond latency and work even if the flag service is temporarily unavailable (using the last cached config).

Flag Cleanup

A flag at 100% completion with no rollback plan should be removed from the codebase. The system tracks which flags have been at 100% for more than N days and surfaces them in a cleanup dashboard. Long-lived flags accrete as dead code, inflate SDK config size, and create confusion about what is the current behavior. Teams should treat flag cleanup as part of the feature delivery process, not an afterthought.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you implement percentage-based rollout so the same user always gets the same flag value?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Hash a stable user identifier (user_id or device_id) concatenated with the flag name using a deterministic hash function (e.g., MurmurHash3 or SHA-256 mod 100). The result is a stable integer in [0, 99]. If that value is less than the configured percentage threshold, the flag is on for that user. Using the flag name in the hash input ensures that a user who falls in the 'on' bucket for one flag is not systematically in the 'on' bucket for all flags, avoiding correlated experiments. Store the rollout percentage in a low-latency config store (Redis or a local cache with short TTL) so evaluation adds under 1ms to request latency.”
}
},
{
“@type”: “Question”,
“name”: “Design a metrics-gated progression system that automatically advances a flag rollout.”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Define a promotion policy per flag: a set of metric guards (e.g., error rate < 0.5%, p99 latency < 300ms, conversion rate not degraded by more than 2%) and a soak period (e.g., hold at 10% for 30 minutes before advancing to 25%). A background scheduler polls your metrics pipeline at each soak boundary, evaluates all guards against the treatment cohort versus control cohort using statistical significance tests (two-proportion z-test for rates, Mann-Whitney for latencies), and advances the percentage only if all guards pass. Persist the rollout state machine (current percentage, soak start timestamp, guard results) in a durable store so restarts don't reset progress. Emit promotion events to an audit log for post-incident review."
}
},
{
"@type": "Question",
"name": "What triggers automated rollback in a feature flag system and how is it implemented safely?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Automated rollback triggers when a guard metric breaches its threshold with sufficient statistical confidence during a soak period, or when an on-call alert fires and calls a rollback API. Safe implementation: (1) Roll back by writing the flag's percentage to 0 in the config store — evaluation clients pick this up within their cache TTL (typically 5–30 seconds). (2) Do not delete the flag; preserve its state and annotate it with the rollback reason and timestamp. (3) Gate rollback on a minimum sample size to avoid rolling back on noise during the first few seconds of a soak. (4) Send rollback events to PagerDuty and post to a Slack channel with a deep link to the metrics dashboard. (5) Require manual human promotion to re-enable the flag after a rollback to prevent a rollback loop."
}
},
{
"@type": "Question",
"name": "How would you design the flag evaluation SDK to work reliably when the flag config service is unavailable?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Implement a multi-layer cache in the SDK: an in-process LRU cache (fastest, no I/O) backed by a local disk snapshot written periodically. On startup, the SDK loads the disk snapshot synchronously so flag evaluation works immediately before the first network fetch completes. Background polling refreshes the in-process cache every N seconds (configurable, typically 10–30s). If the network fetch fails, the SDK continues serving from the in-process cache with no change to flag values. Define a per-flag 'default value' used only when neither cache has a value for that flag (e.g., first-ever startup with no snapshot). Expose a health metric: seconds since last successful config fetch, so on-call can alert if the SDK is running stale for too long."
}
}
]
}