Feature Rollout System Low-Level Design: Flag Evaluation, Percentage Rollout, and Kill Switches

Feature Rollout System: Low-Level Design

A feature rollout system lets engineering teams deploy code to production while controlling which users see the new behavior. It decouples deployment from release: code ships to all servers, but a feature flag gate determines whether each request activates the new path. This design covers the flag evaluation engine, targeting rules, gradual percentage rollouts, kill switches, and the observability needed to roll back safely.

Core Data Model

CREATE TABLE Feature (
    feature_key        VARCHAR(100) PRIMARY KEY,   -- "checkout_v2", "dark_mode"
    description        TEXT,
    status             VARCHAR(20) NOT NULL DEFAULT 'off',  -- off, rolling, on
    rollout_pct        SMALLINT NOT NULL DEFAULT 0,         -- 0-100
    sticky             BOOLEAN NOT NULL DEFAULT TRUE,       -- same user always same bucket
    created_at         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at         TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE FeatureOverride (
    feature_key        VARCHAR(100) NOT NULL REFERENCES Feature(feature_key) ON DELETE CASCADE,
    target_type        VARCHAR(20) NOT NULL,  -- user, org, country, plan
    target_id          VARCHAR(200) NOT NULL, -- user_id, org_id, "US", "enterprise"
    enabled            BOOLEAN NOT NULL,
    created_at         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (feature_key, target_type, target_id)
);

CREATE TABLE FeatureAuditLog (
    log_id             BIGSERIAL PRIMARY KEY,
    feature_key        VARCHAR(100) NOT NULL,
    changed_by         BIGINT NOT NULL,           -- user_id of engineer
    old_value          JSONB,
    new_value          JSONB,
    reason             TEXT,
    changed_at         TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON FeatureOverride(feature_key, target_type);
CREATE INDEX ON FeatureAuditLog(feature_key, changed_at DESC);

Flag Evaluation Algorithm

import hashlib, redis, json
from dataclasses import dataclass
from typing import Optional

redis_client = redis.Redis(host='localhost', decode_responses=True)
FLAG_CACHE_TTL = 60  # seconds

@dataclass
class EvalContext:
    user_id: int
    org_id: Optional[int] = None
    country: Optional[str] = None
    plan: Optional[str] = None

def is_enabled(feature_key: str, ctx: EvalContext) -> bool:
    """
    Evaluation order (first match wins):
    1. Kill switch override (enabled=False for user/org)
    2. Explicit enable override (user, org, country, plan)
    3. Percentage rollout bucket
    4. Global status (on/off)
    """
    flag = _load_flag(feature_key)
    if flag is None:
        return False  # unknown flag → off

    # 1 & 2: explicit overrides — user first, then org, country, plan
    for target_type, target_id in _override_targets(ctx):
        override = _get_override(feature_key, target_type, str(target_id))
        if override is not None:
            return override  # True or False

    # 3: percentage rollout
    if flag['status'] == 'rolling' and flag['rollout_pct'] > 0:
        return _in_rollout_bucket(feature_key, ctx.user_id, flag['rollout_pct'], flag['sticky'])

    # 4: global status
    return flag['status'] == 'on'

def _override_targets(ctx: EvalContext):
    targets = [('user', ctx.user_id)]
    if ctx.org_id:    targets.append(('org',     ctx.org_id))
    if ctx.country:   targets.append(('country',  ctx.country))
    if ctx.plan:      targets.append(('plan',     ctx.plan))
    return targets

def _in_rollout_bucket(feature_key: str, user_id: int, pct: int, sticky: bool) -> bool:
    """
    Deterministic bucket: hash(feature_key + user_id) mod 100.
    Sticky = same user always gets the same bucket for the same flag.
    Non-sticky (rare) would use hash(feature_key + user_id + date) for daily re-assignment.
    """
    seed = f"{feature_key}:{user_id}" if sticky else f"{feature_key}:{user_id}:{_today()}"
    digest = hashlib.md5(seed.encode()).hexdigest()
    bucket = int(digest[:8], 16) % 100  # 0-99
    return bucket  Optional[dict]:
    cache_key = f"flag:{feature_key}"
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    row = db.fetchone(
        "SELECT status, rollout_pct, sticky FROM Feature WHERE feature_key=%s",
        (feature_key,)
    )
    if not row:
        return None
    flag = {'status': row['status'], 'rollout_pct': row['rollout_pct'], 'sticky': row['sticky']}
    redis_client.setex(cache_key, FLAG_CACHE_TTL, json.dumps(flag))
    return flag

def _get_override(feature_key: str, target_type: str, target_id: str) -> Optional[bool]:
    cache_key = f"flag_override:{feature_key}:{target_type}:{target_id}"
    cached = redis_client.get(cache_key)
    if cached is not None:
        return cached == 'true'
    row = db.fetchone(
        "SELECT enabled FROM FeatureOverride WHERE feature_key=%s AND target_type=%s AND target_id=%s",
        (feature_key, target_type, target_id)
    )
    if row is None:
        redis_client.setex(cache_key, FLAG_CACHE_TTL, 'null')
        return None
    redis_client.setex(cache_key, FLAG_CACHE_TTL, 'true' if row['enabled'] else 'false')
    return row['enabled']

Gradual Rollout API

def set_rollout(feature_key: str, pct: int, changed_by: int, reason: str):
    """Advance or retract a percentage rollout. Audit-logged."""
    assert 0 <= pct <= 100
    old = db.fetchone("SELECT status, rollout_pct FROM Feature WHERE feature_key=%s", (feature_key,))
    new_status = 'rolling' if 0 < pct < 100 else ('on' if pct == 100 else 'off')
    db.execute("""
        UPDATE Feature SET rollout_pct=%s, status=%s, updated_at=NOW()
        WHERE feature_key=%s
    """, (pct, new_status, feature_key))
    db.execute("""
        INSERT INTO FeatureAuditLog (feature_key, changed_by, old_value, new_value, reason)
        VALUES (%s, %s, %s, %s, %s)
    """, (
        feature_key, changed_by,
        json.dumps(old), json.dumps({'status': new_status, 'rollout_pct': pct}),
        reason
    ))
    _invalidate_flag_cache(feature_key)

def kill_switch(feature_key: str, changed_by: int, reason: str):
    """Immediately disable for all users regardless of overrides."""
    set_rollout(feature_key, 0, changed_by, reason)

def _invalidate_flag_cache(feature_key: str):
    # Delete the flag cache; overrides time out naturally within FLAG_CACHE_TTL
    redis_client.delete(f"flag:{feature_key}")

SDK Usage in Application Code

# In any application handler:
from feature_flags import is_enabled, EvalContext

def checkout_handler(request):
    ctx = EvalContext(
        user_id=request.user.id,
        org_id=request.user.org_id,
        country=request.geo.country_code,
        plan=request.user.plan,
    )
    if is_enabled('checkout_v2', ctx):
        return checkout_v2(request)
    return checkout_v1(request)

Observability: Exposure Logging

# Log every flag evaluation for analytics and guardrail metrics
def is_enabled_logged(feature_key: str, ctx: EvalContext) -> bool:
    result = is_enabled(feature_key, ctx)
    # Fire-and-forget async log (Kafka or in-process queue)
    analytics.track('flag_evaluated', {
        'feature_key': feature_key,
        'user_id': ctx.user_id,
        'enabled': result,
        'ts': time.time(),
    })
    return result

# Downstream: join flag exposures with conversion events in data warehouse.
# SELECT f.feature_key, f.enabled, COUNT(*) AS users, SUM(c.converted) AS conversions
# FROM FlagExposure f JOIN ConversionEvent c USING (user_id, session_id)
# GROUP BY 1, 2 ORDER BY 1, 2;

Key Design Decisions

Deterministic hash bucketing: MD5(feature_key + user_id) % 100 ensures the same user always gets the same treatment for a given flag — no flicker across page loads or API calls. Using the feature key in the seed means user 42 can be in the 5% bucket for “checkout_v2” but not for “dark_mode” — flags are independent.
Override priority order: user > org > country > plan > rollout > global. Kill switches use user/org overrides with enabled=False — they fire before the percentage bucket is checked, so beta users get the flag but a banned user does not.
60-second Redis cache: flag state reads hit Redis, not Postgres — evaluation adds ~0.5ms. Cache invalidation on update is synchronous (delete on write); stale reads during the 60s window are acceptable for gradual rollouts. For emergency kill switches, explicitly flush the cache after writing.
Audit log for compliance: every rollout change records who changed it, from what state, and why. Enables post-incident review (“who turned on feature X at 2 AM?”).

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why use MD5 hash bucketing instead of a random assignment for percentage rollouts?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Random assignment (Math.random() < 0.1 for 10% rollout) is non-deterministic: a user who loads the page twice in the same session may get different results — the button is green on first load, blue on refresh. This "flickering" is a terrible user experience and pollutes experiment metrics (the user is counted in both groups). MD5(feature_key + user_id) % 100 is deterministic: the same user_id always hashes to the same bucket for the same feature_key. No database lookup needed for evaluation — the hash is computed in-memory in microseconds. Sticky assignment is a core property: once a user is in the 10% bucket, they stay there as rollout advances from 10% to 20% — bucket 9 users who were included at 10% are still included at 20% (buckets 0-19), so existing users are never "demoted" out of a rollout.”}},{“@type”:”Question”,”name”:”How do you implement a kill switch that takes effect within seconds, not minutes?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The Redis cache has a 60-second TTL — setting rollout_pct=0 takes up to 60 seconds to propagate if only the DB is updated. For emergency kill switches where a bug is causing production errors: (1) write the change to Postgres (durable); (2) immediately delete the Redis cache key (redis.delete("flag:feature_key")). The next evaluation for any user will miss cache, read Postgres (status=’off’), re-populate cache with the new value. Kill switch is effective within one cache-miss cycle — typically <100ms. For even faster propagation: use Redis pub/sub to broadcast a "flag_invalidated" message to all application servers. Each server subscribes and purges its local in-process flag cache (if any). Sub-second propagation across the fleet.”}},{“@type”:”Question”,”name”:”How do override rules interact with percentage rollouts for canary deployments?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Override priority: user override > org override > country override > plan override > rollout bucket. This enables the canonical canary pattern: (1) enable for the engineering team via org override (enabled=True for org_id=engineering_org); (2) enable for beta users via user overrides; (3) once validated, advance to 1% rollout for general users. Beta users and engineers always see the feature regardless of rollout_pct — they are explicitly overridden. Non-beta users get the deterministic bucket treatment. This means you can have rollout_pct=1 but effectively 5% coverage because beta users and specific orgs are all overridden. Track "actual reach" separately in the exposure log: SELECT COUNT(DISTINCT user_id) FROM FlagExposureLog WHERE feature_key=’X’ AND enabled=TRUE AND date=TODAY.”}},{“@type”:”Question”,”name”:”How do you run a gradual rollout that automatically advances based on error rate?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Automated progressive delivery: a background job monitors error metrics and advances (or halts) the rollout. Implementation: every 15 minutes, the rollout automation queries the error rate in the new bucket vs. the control bucket. SELECT SUM(error_count)/SUM(request_count) AS error_rate FROM RequestMetrics WHERE feature_flag=’X’ AND enabled=TRUE AND created_at > NOW()-INTERVAL ’15m’. If error_rate < threshold (e.g. <0.1% for a previously <0.05% baseline): advance rollout_pct by 10%. If error_rate > kill_threshold (e.g. >1%): call kill_switch() immediately. This is the Kubernetes canary analysis pattern — Argo Rollouts and Flagger implement this in CI/CD. In a bespoke system, the automation job is a simple cron process that reads flag config, reads metrics, and calls set_rollout(). Alert the on-call engineer on any automated kill switch.”}},{“@type”:”Question”,”name”:”How do you clean up stale feature flags that were shipped 100% and never removed?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Flags accumulate: a codebase with 6 months of development can have 200+ flags, most of which are shipped (status=on) and the code never cleaned up. These dead flags add evaluation overhead and confusion. Cleanup process: (1) in code, replace is_enabled(‘flag_key’, ctx) call sites with the hardcoded True/False once the flag is fully shipped; (2) delete the FeatureOverride rows; (3) delete the Feature row; (4) delete the Redis cache entry. Automate detection: any flag with status=on and no evaluation events in the last 30 days is a candidate for removal. Report: SELECT feature_key FROM Feature WHERE status=’on’ AND feature_key NOT IN (SELECT DISTINCT feature_key FROM FlagExposureLog WHERE evaluated_at > NOW()-INTERVAL ’30d’). Alert the team that owns the flag.”}}]}

Feature rollout and flag evaluation system design is discussed in Google system design interview questions.

Feature rollout and gradual deployment system design is covered in Meta system design interview preparation.

Feature rollout and canary deployment design is discussed in LinkedIn system design interview guide.