Feature Flag Service: Low-Level Design
A feature flag service lets engineering teams ship code dark, run controlled rollouts, and flip kill switches without deploying. The system must evaluate flags in microseconds, support rich targeting rules, guarantee consistent assignment for a given user, and keep an immutable audit trail of every change.
Flag Types
Three flag types cover most use cases:
- Boolean — on/off; the simplest case. Evaluation returns true or false.
- Multivariate — a set of named string variants (e.g., “control”, “variant_a”, “variant_b”). Useful for A/B/n experiments.
- Percentage rollout — a special case of multivariate where variants are assigned by bucket. “5% get variant_a, 95% get control.”
Targeting Rules
Rules are evaluated in priority order. Each rule has one or more conditions (ANDed together) and a variant to return if all conditions match. The first matching rule wins. A default variant is returned if no rule matches.
Example rule set for flag new_checkout:
- Priority 1:
user.plan == "enterprise"→ variant"enabled" - Priority 2:
user.country == "US"ANDbucket < 20→ variant"enabled" - Default → variant
"disabled"
Deterministic Bucketing
Bucket assignment must be stable: the same user must get the same variant on every evaluation, across every SDK instance, without storing the assignment. This is achieved by hashing the concatenation of the flag key and user ID, then taking the result modulo 100.
import hashlib
def compute_bucket(flag_key: str, user_id: str) -> int:
"""Returns a stable integer in [0, 100) for this (flag, user) pair."""
seed = f"{flag_key}:{user_id}"
digest = hashlib.sha256(seed.encode()).hexdigest()
return int(digest[:8], 16) % 100
Using SHA-256 ensures uniform distribution. The first 8 hex characters (32 bits) provide enough entropy for modulo 100 without perceptible bias.
Flag Evaluation
from dataclasses import dataclass
@dataclass
class UserContext:
user_id: str
attributes: dict # e.g. {"plan": "pro", "country": "US"}
def evaluate_flag(flag_key: str, user_ctx: UserContext, db) -> str:
"""Returns the variant string for this user, or the default variant."""
flag = db.fetchone(
"SELECT id, type, status, default_variant "
"FROM feature_flag WHERE key = %s",
(flag_key,)
)
if not flag or flag.status == "disabled":
return flag.default_variant if flag else "off"
return apply_rules(flag, user_ctx, db)
def apply_rules(flag, user_ctx: UserContext, db) -> str:
"""Evaluate ordered rules; return first matching variant."""
rules = db.fetchall(
"SELECT conditions, variant, priority "
"FROM flag_rule WHERE flag_id = %s ORDER BY priority ASC",
(flag.id,)
)
bucket = compute_bucket(flag.key, user_ctx.user_id)
for rule in rules:
if _conditions_match(rule.conditions, user_ctx, bucket):
_log_evaluation(flag.id, user_ctx.user_id, rule.variant, db)
return rule.variant
return flag.default_variant
def _conditions_match(conditions: list[dict], user_ctx: UserContext, bucket: int) -> bool:
for cond in conditions:
attr = cond["attribute"]
op = cond["operator"] # "eq" | "lt" | "gt" | "in"
val = cond["value"]
if attr == "bucket":
actual = bucket
else:
actual = user_ctx.attributes.get(attr)
if actual is None:
return False
if op == "eq" and actual != val: return False
if op == "lt" and actual >= val: return False
if op == "gt" and actual <= val: return False
if op == "in" and actual not in val: return False
return True
Kill Switch
When a flag's status is set to 'disabled', the evaluation function returns default_variant immediately, bypassing all rules. The SDK-side cache must invalidate within seconds of a kill switch being flipped. This is handled via Server-Sent Events (SSE): the flag service pushes a flag_updated event to all connected SDK instances, which then purge the affected flag from their local cache.
SDK-Side Caching
Each SDK instance keeps an in-process LRU cache of evaluated flag results keyed by (flag_key, user_id), with a configurable TTL (default 60 seconds). On a cache miss the SDK fetches the flag definition from the service and evaluates locally, avoiding a round-trip per flag per request. The SSE channel allows the service to push invalidations for flags changed via kill switch or emergency rule edits.
Audit Trail
Every mutation to a flag or its rules is written to flag_audit as an immutable append. The record stores the actor's user ID, the timestamp, the operation type, and a JSON diff of the before and after state. This supports compliance queries (“who disabled the payment flag at 03:14 UTC?”) without relying on application logs.
Database Schema
CREATE TABLE feature_flag (
id BIGSERIAL PRIMARY KEY,
key VARCHAR(255) NOT NULL UNIQUE,
type VARCHAR(32) NOT NULL, -- 'boolean' | 'multivariate' | 'rollout'
status VARCHAR(32) NOT NULL DEFAULT 'active',
default_variant VARCHAR(255) NOT NULL DEFAULT 'off',
description TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE flag_rule (
id BIGSERIAL PRIMARY KEY,
flag_id BIGINT NOT NULL REFERENCES feature_flag(id) ON DELETE CASCADE,
priority INT NOT NULL,
conditions JSONB NOT NULL, -- array of {attribute, operator, value}
variant VARCHAR(255) NOT NULL,
UNIQUE (flag_id, priority)
);
CREATE TABLE flag_evaluation (
id BIGSERIAL PRIMARY KEY,
flag_id BIGINT NOT NULL REFERENCES feature_flag(id),
user_id VARCHAR(255) NOT NULL,
variant VARCHAR(255) NOT NULL,
evaluated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_fe_flag_user ON flag_evaluation(flag_id, user_id);
CREATE TABLE flag_audit (
id BIGSERIAL PRIMARY KEY,
flag_id BIGINT NOT NULL REFERENCES feature_flag(id),
actor_id BIGINT NOT NULL,
operation VARCHAR(64) NOT NULL,
diff JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Scaling Considerations
Flag definitions change rarely; reads dominate. Cache the full flag + rules payload in Redis keyed by flag:<key> with a short TTL (5–10 seconds). Writes invalidate the Redis key immediately. For very high-traffic services, embed the flag evaluation logic in the SDK and ship flag configuration as a versioned JSON blob, eliminating all network calls during hot-path evaluation.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Atlassian Interview Guide
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture