System Design Interview: Feature Flag System (LaunchDarkly)

What Are Feature Flags?

Feature flags (also called feature toggles or feature gates) decouple code deployment from feature release. A flag in code wraps new functionality — if the flag is off, the old code path runs; if on, the new code runs. This enables: gradual rollouts (enable for 1% of users, then 10%, then 100%), kill switches (instantly disable a broken feature without a code deploy), A/B testing (show variant A to 50% and variant B to 50%), and trunk-based development (merge incomplete features behind a flag). Companies like Facebook, Google, and Airbnb run hundreds of simultaneous experiments using feature flags.

Functional Requirements

  • Create and manage flags: on/off, multivariate (multiple variants), targeting rules
  • Evaluate a flag for a given user in under 1ms (p99)
  • Support targeting rules: user ID, user attributes (country, plan, email), percentage rollout
  • Flag changes propagate to all SDK instances within 5 seconds
  • Audit log of all flag changes with who changed what and when
  • 10,000 flags, 100 million users, 1 billion flag evaluations per day

Flag Evaluation Logic

Flag evaluation is a pure function: given a flag definition and a user context, return the flag value. Evaluation order: (1) Check if flag is enabled at all — if not, return the default value. (2) Check individual targeting rules in order — if user ID matches a list, or user attribute matches a rule (country = “US”), return the configured variant for that rule. (3) Percentage rollout — hash the user ID (or a stable anonymous ID) to a value in [0, 100). If the hash falls in the enabled bucket, return the on variant; otherwise return off. The hash must be stable: the same user must always get the same bucket across all SDKs and re-evaluations.


import hashlib

def evaluate_flag(flag, user):
    if not flag.enabled:
        return flag.default_value

    # Individual targeting rules (highest priority)
    for rule in flag.rules:
        if matches_rule(user, rule):
            return rule.variant

    # Percentage rollout (stable hash guarantees same bucket for same user)
    if flag.rollout_percentage > 0:
        hash_key = f"{flag.key}.{user.id}"
        bucket = int(hashlib.md5(hash_key.encode()).hexdigest(), 16) % 100
        if bucket < flag.rollout_percentage:
            return flag.on_variant

    return flag.off_variant

def matches_rule(user, rule):
    attr_value = getattr(user, rule.attribute, None)
    if rule.operator == "IN":
        return attr_value in rule.values
    if rule.operator == "STARTS_WITH":
        return str(attr_value).startswith(rule.values[0])
    return False

SDK Architecture

The SDK (embedded in application code) evaluates flags locally, not via a network call per evaluation — 1ms p99 would be impossible with a round trip. Architecture: (1) On startup, the SDK fetches all flag definitions from the flag service and caches them in memory. (2) The SDK opens a persistent connection (Server-Sent Events or WebSocket) to a streaming endpoint. When any flag changes, the streaming endpoint pushes the delta to all connected SDK instances. (3) Flag evaluation runs in-process against the local cache — zero network I/O, sub-millisecond. (4) Usage events (which variant was served to which user) are batched and sent asynchronously for analytics and experiment analysis.

Streaming Flag Updates

When an operator changes a flag, it must propagate to all SDK instances within seconds. Architecture: (1) Flag change is written to the flag database (PostgreSQL). (2) A change event is published to Kafka. (3) A fanout service consumes from Kafka and pushes the change to all connected SDK instances via Server-Sent Events (SSE). SSE is preferable to WebSocket for one-directional push — simpler to implement and scales well with CDN edge servers. (4) SDKs that are disconnected poll the flag service every 30 seconds as a fallback. The streaming service must handle millions of concurrent SDK connections — use a pub/sub broadcast model (Redis pub/sub or a purpose-built server like LaunchDarkly Relay Proxy) to fan out to SDK clusters by region.

A/B Testing Integration

Feature flags and A/B testing are tightly coupled — a flag with a 50/50 percentage rollout is an A/B test. For statistical validity: (1) Assignment must be stable — the same user always gets the same variant. The hash-based assignment guarantees this. (2) Assignment must be logged — every evaluation event with user ID, flag key, and variant is sent to the analytics pipeline. (3) Statistical analysis determines if the treatment effect is real: measure conversion rate, revenue, latency, or engagement per variant. Run the experiment until statistical significance is reached (p less than 0.05, typically 1-2 weeks). (4) Experiment collision: if two experiments target the same users simultaneously, their effects confound each other. Use mutually exclusive experiment groups or a layer-based assignment system (each experiment layer is independent).

Interview Tips

  • Local SDK evaluation with streaming updates is the architecture — never per-evaluation RPC
  • Hash-based percentage rollout guarantees stable assignment — explain MD5 hash mod 100
  • SSE for flag push is better than polling or WebSocket for this use case
  • Audit log is required — flag changes are high-risk operations (can break production instantly)
  • A/B test integration (stable assignment + event logging + statistical analysis) is a strong differentiator

Frequently Asked Questions

How does a feature flag system evaluate flags without a network call?

Feature flag SDKs use local evaluation: on startup, the SDK fetches all flag definitions from the flag service and stores them in memory. When application code evaluates a flag, the SDK runs the evaluation logic entirely in-process against the local cache — no network I/O, no latency beyond a hash computation. Flag changes propagate via a persistent streaming connection (Server-Sent Events or WebSocket) from the flag service. When a flag changes, the server pushes a delta update to all connected SDKs within seconds, and the SDK updates its local cache. This architecture achieves sub-millisecond evaluation at billions of evaluations per day without any per-evaluation RPC calls.

How does percentage rollout in a feature flag system guarantee stable assignments?

Percentage rollout uses a stable hash function: concatenate the flag key and user ID, compute a hash (MD5 or MurmurHash), and take the result modulo 100 to get a bucket in [0, 100). If the bucket falls below the rollout percentage, return the enabled variant. Because the hash is deterministic, the same user always gets the same bucket for a given flag — a user in bucket 42 always sees the enabled variant when the rollout is 50%. This stability is critical for A/B testing: if users randomly flip between variants on each visit, the experiment is invalid and any measured effect is noise. The flag key is included in the hash so that different flags independently assign users — the same user can be in the enabled bucket for flag A and the disabled bucket for flag B.

What is the difference between a feature flag and an A/B test?

A feature flag is a general mechanism for controlling feature availability — it can be a simple on/off switch, a targeted rollout to specific users, or a percentage-based rollout. An A/B test is a specific use of a feature flag with two additional requirements: (1) the assignment is random and stable (same user always gets the same variant), and (2) the experiment is instrumented to measure an outcome metric (conversion rate, revenue, engagement) per variant, with statistical analysis to determine whether the difference is significant. Every A/B test is a feature flag, but not every feature flag is an A/B test. A kill switch, a gradual rollout, and an employee-only preview are all feature flags without the statistical rigor of an A/B test.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does a feature flag system evaluate flags without a network call?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Feature flag SDKs use local evaluation: on startup, the SDK fetches all flag definitions from the flag service and stores them in memory. When application code evaluates a flag, the SDK runs the evaluation logic entirely in-process against the local cache — no network I/O, no latency beyond a hash computation. Flag changes propagate via a persistent streaming connection (Server-Sent Events or WebSocket) from the flag service. When a flag changes, the server pushes a delta update to all connected SDKs within seconds, and the SDK updates its local cache. This architecture achieves sub-millisecond evaluation at billions of evaluations per day without any per-evaluation RPC calls.” } }, { “@type”: “Question”, “name”: “How does percentage rollout in a feature flag system guarantee stable assignments?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Percentage rollout uses a stable hash function: concatenate the flag key and user ID, compute a hash (MD5 or MurmurHash), and take the result modulo 100 to get a bucket in [0, 100). If the bucket falls below the rollout percentage, return the enabled variant. Because the hash is deterministic, the same user always gets the same bucket for a given flag — a user in bucket 42 always sees the enabled variant when the rollout is 50%. This stability is critical for A/B testing: if users randomly flip between variants on each visit, the experiment is invalid and any measured effect is noise. The flag key is included in the hash so that different flags independently assign users — the same user can be in the enabled bucket for flag A and the disabled bucket for flag B.” } }, { “@type”: “Question”, “name”: “What is the difference between a feature flag and an A/B test?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A feature flag is a general mechanism for controlling feature availability — it can be a simple on/off switch, a targeted rollout to specific users, or a percentage-based rollout. An A/B test is a specific use of a feature flag with two additional requirements: (1) the assignment is random and stable (same user always gets the same variant), and (2) the experiment is instrumented to measure an outcome metric (conversion rate, revenue, engagement) per variant, with statistical analysis to determine whether the difference is significant. Every A/B test is a feature flag, but not every feature flag is an A/B test. A kill switch, a gradual rollout, and an employee-only preview are all feature flags without the statistical rigor of an A/B test.” } } ] }

Companies That Ask This Question

Scroll to Top