How does a feature flag SDK evaluate flags in sub-millisecond time?

The server-side SDK maintains a local in-memory cache of ALL flag configurations, received via streaming (SSE/WebSocket) from the flag service. Flag evaluation happens entirely in-process with zero network calls: (1) Check kill switch (flag globally disabled?). (2) Check individual targeting (specific user in the list?). (3) Evaluate targeting rules in order (attribute-based conditions). (4) Check percentage rollout: hash(flag_key + user_key) % 100 determines the user bucket deterministically. If bucket < rollout_percentage, return enabled. (5) Return default. Total: 25% -> 100%) and controlled experiments (50/50 with statistical analysis).

System Design: Design Feature Flag Platform (LaunchDarkly) — Targeting Rules, Percentage Rollout, Experimentation

⏱ 5 min read

Feature flag platforms like LaunchDarkly, Unleash, and Split enable teams to decouple deployment from release, run experiments, and manage risk with kill switches. Designing a feature flag platform tests your understanding of real-time flag evaluation, targeting rules, percentage-based rollout, and the infrastructure that delivers flag decisions to millions of application instances with sub-millisecond latency.

Flag Evaluation Engine

The core of the system: given a flag key, a user context, and the flag rules, determine the flag variation (true/false, or a specific variant for multivariate flags). Evaluation logic: (1) Check kill switch — if the flag is globally disabled, return the default variation immediately. (2) Check individual targeting — is this specific user in the flag targeting list? (e.g., always show the new feature to user_id = “internal-tester-123”). (3) Check targeting rules — evaluate rules in order. Each rule: IF (user.plan == “enterprise” AND user.country IN [“US”, “UK”]) THEN return variation “enabled.” Rules support: string/number comparisons, set membership, regex, semver comparison, and custom attributes. (4) Check percentage rollout — if no rule matches, use a percentage rollout: hash(flag_key + user_key) % 100 determines the bucket (0-99). If bucket < rollout_percentage, return "enabled." The hash ensures: the same user always gets the same bucket (deterministic), different flags distribute independently (different hash inputs), and adding users changes which variation they see proportionally (no reshuffling). (5) Default — if nothing matches, return the default variation. This evaluation must be fast (< 1ms) and executed locally (no network call per evaluation). The application SDK maintains a local cache of all flag configurations.

SDK Architecture and Flag Delivery

The application integrates a client SDK that evaluates flags locally. Two SDK types: (1) Server-side SDK — runs in the backend (Node.js, Java, Python, Go). Connects to the flag service via streaming (SSE or WebSocket) or polling. Receives the complete flag configuration for all flags. Evaluates flags locally in-process (no network call per evaluation). Latency: < 1ms per evaluation. The SDK maintains an in-memory cache of all flag configs. On flag change: the streaming connection pushes the update immediately (sub-second propagation). (2) Client-side SDK — runs in the browser or mobile app. Different security model: the SDK does not receive the complete flag configuration (which would expose all targeting rules and user segments to the client). Instead: the backend evaluates flags for the specific user and sends only the evaluated variations. The client SDK receives: {flag_a: true, flag_b: "variant-2", flag_c: false}. No targeting rules are exposed. Flag changes are delivered via SSE/WebSocket or polling (every 30 seconds). Offline mode: the SDK caches the last known flag values locally (in localStorage for web, SharedPreferences for mobile). If the connection to the flag service is lost, the SDK uses cached values. This ensures the application continues functioning even during flag service outages.

Targeting and Segmentation

Targeting rules determine which users see which variation. User context: the application passes user attributes to the SDK: {key: “user-123”, email: “user@company.com”, plan: “enterprise”, country: “US”, created_at: “2024-01-15”, custom: {team_size: 50}}. Rules reference these attributes. Segments: reusable groups of users. A segment “beta-testers” contains specific user keys or attribute rules (plan == “enterprise” AND country == “US”). Multiple flags can target the same segment. Segments are managed centrally and cached by the SDK alongside flag configs. Percentage rollout details: the hash function (MurmurHash3 or similar) takes flag_key + user_key as input. This produces a deterministic bucket assignment per user per flag. Gradually increasing the percentage from 10% to 50% to 100% adds new users to the enabled group without changing the assignment of already-enabled users. This is critical: a user who sees the new feature at 10% rollout must continue seeing it at 50% rollout. The hash-based approach guarantees this (the user bucket does not change when the percentage changes). Multivariate flags: instead of true/false, a flag can have multiple variations: “control” (33%), “variant-A” (33%), “variant-B” (34%). The user bucket determines which variation they receive.

Experimentation and A/B Testing

Feature flags enable controlled experiments. Setup: create a flag with variations (control and treatment). Set a 50/50 percentage rollout. Instrument the application to track metrics: click-through rate, conversion, revenue, error rate. Analysis: (1) The flag service logs every flag evaluation: {flag_key, user_key, variation, timestamp}. (2) The analytics pipeline joins flag evaluations with business metrics (user_key is the join key). (3) Statistical analysis compares metrics between control and treatment groups: is the difference statistically significant (p < 0.05)? Is the sample size sufficient? What is the confidence interval? Tools: the flag platform may include built-in experimentation (LaunchDarkly Experimentation, Split) or integrate with external analytics (Amplitude, Mixpanel, Segment). Sequential testing: for faster decisions, use sequential analysis (not just fixed-sample t-tests). Monitor the experiment continuously and stop when significance is reached or the experiment is clearly losing. Guardrail metrics: in addition to the primary metric (conversion), monitor guardrails (error rate, latency, crash rate). If a guardrail metric degrades significantly, automatically disable the treatment (kill switch triggered by metric monitoring).

Flag Lifecycle and Technical Debt

Feature flags are powerful but accumulate as technical debt if not managed. Lifecycle: (1) Created — flag is defined with targeting rules. Code paths are wrapped in flag checks. (2) Active — the flag is being rolled out or used for experimentation. (3) Fully rolled out — 100% of users see the new variation. The flag check is now unnecessary. (4) Stale — the flag has been at 100% for > 30 days and the old code path is dead. The flag should be removed. (5) Removed — the flag check and old code path are deleted from the codebase. The flag configuration is archived. Stale flag detection: track flag age and rollout percentage. Alert when: a flag has been at 100% for > 30 days (should be cleaned up), a flag has not been evaluated in > 7 days (may be unused), or the number of active flags exceeds a team threshold (flag debt accumulating). Best practices: (1) Set a cleanup date when creating the flag. (2) Include the flag key in the code comment (easy to grep and remove). (3) Automate stale flag detection with a CI check that flags unused feature checks. (4) Each sprint, allocate time for flag cleanup (like addressing other tech debt). (5) Keep flags short-lived — a flag that exists for 6 months is not a feature flag, it is a configuration. Move it to a config service. Organizations with thousands of active flags experience: confusion (which flags are active?), code complexity (nested flag checks), and testing burden (must test all combinations). Disciplined flag management prevents this.