Feature Flag Service Low-Level Design: Targeting Rules, Gradual Rollout, and Kill Switch

Feature Flag Service: Low-Level Design

A feature flag service lets engineering teams ship code dark, run controlled rollouts, and flip kill switches without deploying. The system must evaluate flags in microseconds, support rich targeting rules, guarantee consistent assignment for a given user, and keep an immutable audit trail of every change.

Flag Types

Three flag types cover most use cases:

Boolean — on/off; the simplest case. Evaluation returns true or false.
Multivariate — a set of named string variants (e.g., “control”, “variant_a”, “variant_b”). Useful for A/B/n experiments.
Percentage rollout — a special case of multivariate where variants are assigned by bucket. “5% get variant_a, 95% get control.”

Targeting Rules

Rules are evaluated in priority order. Each rule has one or more conditions (ANDed together) and a variant to return if all conditions match. The first matching rule wins. A default variant is returned if no rule matches.

Example rule set for flag new_checkout:

Priority 1: user.plan == "enterprise" → variant "enabled"
Priority 2: user.country == "US" AND bucket < 20 → variant "enabled"
Default → variant "disabled"

Deterministic Bucketing

Bucket assignment must be stable: the same user must get the same variant on every evaluation, across every SDK instance, without storing the assignment. This is achieved by hashing the concatenation of the flag key and user ID, then taking the result modulo 100.


import hashlib

def compute_bucket(flag_key: str, user_id: str) -> int:
    """Returns a stable integer in [0, 100) for this (flag, user) pair."""
    seed = f"{flag_key}:{user_id}"
    digest = hashlib.sha256(seed.encode()).hexdigest()
    return int(digest[:8], 16) % 100

Using SHA-256 ensures uniform distribution. The first 8 hex characters (32 bits) provide enough entropy for modulo 100 without perceptible bias.

Flag Evaluation


from dataclasses import dataclass

@dataclass
class UserContext:
    user_id: str
    attributes: dict  # e.g. {"plan": "pro", "country": "US"}

def evaluate_flag(flag_key: str, user_ctx: UserContext, db) -> str:
    """Returns the variant string for this user, or the default variant."""
    flag = db.fetchone(
        "SELECT id, type, status, default_variant "
        "FROM feature_flag WHERE key = %s",
        (flag_key,)
    )
    if not flag or flag.status == "disabled":
        return flag.default_variant if flag else "off"

    return apply_rules(flag, user_ctx, db)

def apply_rules(flag, user_ctx: UserContext, db) -> str:
    """Evaluate ordered rules; return first matching variant."""
    rules = db.fetchall(
        "SELECT conditions, variant, priority "
        "FROM flag_rule WHERE flag_id = %s ORDER BY priority ASC",
        (flag.id,)
    )
    bucket = compute_bucket(flag.key, user_ctx.user_id)

    for rule in rules:
        if _conditions_match(rule.conditions, user_ctx, bucket):
            _log_evaluation(flag.id, user_ctx.user_id, rule.variant, db)
            return rule.variant

    return flag.default_variant

def _conditions_match(conditions: list[dict], user_ctx: UserContext, bucket: int) -> bool:
    for cond in conditions:
        attr = cond["attribute"]
        op   = cond["operator"]  # "eq" | "lt" | "gt" | "in"
        val  = cond["value"]

        if attr == "bucket":
            actual = bucket
        else:
            actual = user_ctx.attributes.get(attr)

        if actual is None:
            return False
        if op == "eq"  and actual != val:  return False
        if op == "lt"  and actual >= val:  return False
        if op == "gt"  and actual <= val:  return False
        if op == "in"  and actual not in val: return False
    return True

Kill Switch

When a flag's status is set to 'disabled', the evaluation function returns default_variant immediately, bypassing all rules. The SDK-side cache must invalidate within seconds of a kill switch being flipped. This is handled via Server-Sent Events (SSE): the flag service pushes a flag_updated event to all connected SDK instances, which then purge the affected flag from their local cache.

SDK-Side Caching

Each SDK instance keeps an in-process LRU cache of evaluated flag results keyed by (flag_key, user_id), with a configurable TTL (default 60 seconds). On a cache miss the SDK fetches the flag definition from the service and evaluates locally, avoiding a round-trip per flag per request. The SSE channel allows the service to push invalidations for flags changed via kill switch or emergency rule edits.

Audit Trail

Every mutation to a flag or its rules is written to flag_audit as an immutable append. The record stores the actor's user ID, the timestamp, the operation type, and a JSON diff of the before and after state. This supports compliance queries (“who disabled the payment flag at 03:14 UTC?”) without relying on application logs.

Database Schema


CREATE TABLE feature_flag (
    id              BIGSERIAL PRIMARY KEY,
    key             VARCHAR(255) NOT NULL UNIQUE,
    type            VARCHAR(32)  NOT NULL,   -- 'boolean' | 'multivariate' | 'rollout'
    status          VARCHAR(32)  NOT NULL DEFAULT 'active',
    default_variant VARCHAR(255) NOT NULL DEFAULT 'off',
    description     TEXT,
    created_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE TABLE flag_rule (
    id          BIGSERIAL PRIMARY KEY,
    flag_id     BIGINT      NOT NULL REFERENCES feature_flag(id) ON DELETE CASCADE,
    priority    INT         NOT NULL,
    conditions  JSONB       NOT NULL,  -- array of {attribute, operator, value}
    variant     VARCHAR(255) NOT NULL,
    UNIQUE (flag_id, priority)
);

CREATE TABLE flag_evaluation (
    id           BIGSERIAL PRIMARY KEY,
    flag_id      BIGINT       NOT NULL REFERENCES feature_flag(id),
    user_id      VARCHAR(255) NOT NULL,
    variant      VARCHAR(255) NOT NULL,
    evaluated_at TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_fe_flag_user ON flag_evaluation(flag_id, user_id);

CREATE TABLE flag_audit (
    id          BIGSERIAL PRIMARY KEY,
    flag_id     BIGINT       NOT NULL REFERENCES feature_flag(id),
    actor_id    BIGINT       NOT NULL,
    operation   VARCHAR(64)  NOT NULL,
    diff        JSONB        NOT NULL,
    created_at  TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

Scaling Considerations

Flag definitions change rarely; reads dominate. Cache the full flag + rules payload in Redis keyed by flag:<key> with a short TTL (5–10 seconds). Writes invalidate the Redis key immediately. For very high-traffic services, embed the flag evaluation logic in the SDK and ship flag configuration as a versioned JSON blob, eliminating all network calls during hot-path evaluation.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why is deterministic bucketing done by hash instead of random assignment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Hash-based bucketing produces the same bucket for the same (flag_key, user_id) pair every time, on every server, without storing the assignment. Random assignment would require persisting the result per user per flag, which adds write latency and storage cost at scale.”
}
},
{
“@type”: “Question”,
“name”: “What happens when two rules could both match a user?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Rules are evaluated in ascending priority order and the first match wins. Rule priority is set explicitly by the operator when the rule is created. This makes the evaluation order deterministic and easy to reason about: lower priority number means higher precedence.”
}
},
{
“@type”: “Question”,
“name”: “How fast does a kill switch take effect?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The flag service pushes an invalidation event over SSE to all connected SDK instances within milliseconds of the status being set to disabled. SDKs that receive the event purge the flag from their local cache immediately. SDKs that are disconnected fall back to TTL expiry, so the worst-case latency equals the SDK cache TTL (typically 60 seconds).”
}
},
{
“@type”: “Question”,
“name”: “What does the audit trail store?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each audit record stores the flag ID, the actor user ID, the operation type (create, update_rule, toggle_status, delete), a JSON diff showing the before and after state, and a timestamp. Records are append-only and never updated or deleted, providing a complete tamper-evident history of every flag change.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is deterministic bucketing implemented for percentage rollouts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SHA-256 hash of (flag_key + ':' + user_id) is converted to an integer modulo 100; users below the rollout percentage always see the flag enabled.”
}
},
{
“@type”: “Question”,
“name”: “How are targeting rules evaluated in priority order?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “FlagRule rows have a priority column; the evaluator iterates rules ascending by priority and returns the first rule whose JSONB conditions all match the user context.”
}
},
{
“@type”: “Question”,
“name”: “How does a kill switch propagate immediately to all SDK instances?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The flag service publishes a flag_updated event via SSE or WebSocket; connected SDK clients update their local cache immediately without waiting for TTL expiry.”
}
},
{
“@type”: “Question”,
“name”: “How are flag changes audited?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Every mutation to FeatureFlag or FlagRule appends a FlagAudit row with the actor user ID, changed fields (before/after diff), and timestamp; this log is immutable.”
}
}
]
}