Q: What triggers automated rollback in a feature flag system and how is it implemented safely?

Automated rollback triggers when a guard metric breaches its threshold with sufficient statistical confidence during a soak period, or when an on-call alert fires and calls a rollback API. Safe implementation: (1) Roll back by writing the flag's percentage to 0 in the config store — evaluation clients pick this up within their cache TTL (typically 5–30 seconds). (2) Do not delete the flag; preserve its state and annotate it with the rollback reason and timestamp. (3) Gate rollback on a minimum sample size to avoid rolling back on noise during the first few seconds of a soak. (4) Send rollback events to PagerDuty and post to a Slack channel with a deep link to the metrics dashboard. (5) Require manual human promotion to re-enable the flag after a rollback to prevent a rollback loop.

Q: How would you design the flag evaluation SDK to work reliably when the flag config service is unavailable?

Implement a multi-layer cache in the SDK: an in-process LRU cache (fastest, no I/O) backed by a local disk snapshot written periodically. On startup, the SDK loads the disk snapshot synchronously so flag evaluation works immediately before the first network fetch completes. Background polling refreshes the in-process cache every N seconds (configurable, typically 10–30s). If the network fetch fails, the SDK continues serving from the in-process cache with no change to flag values. Define a per-flag 'default value' used only when neither cache has a value for that flag (e.g., first-ever startup with no snapshot). Expose a health metric: seconds since last successful config fetch, so on-call can alert if the SDK is running stale for too long.

Question 1

How do you implement percentage-based rollout so the same user always gets the same flag value?

Accepted Answer

Hash a stable user identifier (user_id or device_id) concatenated with the flag name using a deterministic hash function (e.g., MurmurHash3 or SHA-256 mod 100). The result is a stable integer in [0, 99]. If that value is less than the configured percentage threshold, the flag is on for that user. Using the flag name in the hash input ensures that a user who falls in the 'on' bucket for one flag is not systematically in the 'on' bucket for all flags, avoiding correlated experiments. Store the rollout percentage in a low-latency config store (Redis or a local cache with short TTL) so evaluation adds under 1ms to request latency.

Question 2

Design a metrics-gated progression system that automatically advances a flag rollout.

Accepted Answer

Define a promotion policy per flag: a set of metric guards (e.g., error rate < 0.5%, p99 latency < 300ms, conversion rate not degraded by more than 2%) and a soak period (e.g., hold at 10% for 30 minutes before advancing to 25%). A background scheduler polls your metrics pipeline at each soak boundary, evaluates all guards against the treatment cohort versus control cohort using statistical significance tests (two-proportion z-test for rates, Mann-Whitney for latencies), and advances the percentage only if all guards pass. Persist the rollout state machine (current percentage, soak start timestamp, guard results) in a durable store so restarts don't reset progress. Emit promotion events to an audit log for post-incident review.

Question 3

What triggers automated rollback in a feature flag system and how is it implemented safely?

Accepted Answer

Automated rollback triggers when a guard metric breaches its threshold with sufficient statistical confidence during a soak period, or when an on-call alert fires and calls a rollback API. Safe implementation: (1) Roll back by writing the flag's percentage to 0 in the config store — evaluation clients pick this up within their cache TTL (typically 5–30 seconds). (2) Do not delete the flag; preserve its state and annotate it with the rollback reason and timestamp. (3) Gate rollback on a minimum sample size to avoid rolling back on noise during the first few seconds of a soak. (4) Send rollback events to PagerDuty and post to a Slack channel with a deep link to the metrics dashboard. (5) Require manual human promotion to re-enable the flag after a rollback to prevent a rollback loop.

Question 4

How would you design the flag evaluation SDK to work reliably when the flag config service is unavailable?

Accepted Answer

Implement a multi-layer cache in the SDK: an in-process LRU cache (fastest, no I/O) backed by a local disk snapshot written periodically. On startup, the SDK loads the disk snapshot synchronously so flag evaluation works immediately before the first network fetch completes. Background polling refreshes the in-process cache every N seconds (configurable, typically 10–30s). If the network fetch fails, the SDK continues serving from the in-process cache with no change to flag values. Define a per-flag 'default value' used only when neither cache has a value for that flag (e.g., first-ever startup with no snapshot). Expose a health metric: seconds since last successful config fetch, so on-call can alert if the SDK is running stale for too long.

Feature Flag Rollout System Low-Level Design: Percentage Rollout, Metrics-Gated Progression, and Automated Rollback

Rollout Schema

Percentage Rollout with Consistent Hashing

Automated Progression

Metrics Monitoring During Rollout

Automated Rollback

Manual Controls

Targeting Rules

Flag Evaluation SDK

Flag Cleanup