A/B Experiment Platform Low-Level Design: Assignment, Holdout Groups, Metric Collection, and Statistical Analysis

Why a Dedicated Experiment Platform?

A/B testing sounds simple — split users, measure a metric, run a t-test — but at scale it involves subtle correctness requirements: preventing the same user from seeing different variants across sessions, avoiding experiment interactions that corrupt results, correcting for the multiple-comparisons problem when running hundreds of simultaneous experiments, and stopping experiments early without inflating false positive rates. A dedicated platform encodes these requirements so product teams get correct results without becoming statisticians.

Experiment Model

An experiment is defined by:

  • key: a stable string identifier (e.g., checkout_button_color_v2).
  • variants: JSONB array of variant objects, each with a name and weight. Weights must sum to 100.
  • traffic_pct: percent of eligible users to enroll (0-100). Set below 100 to reduce blast radius.
  • start_at / end_at: experiment window.
  • exclusion_group_id: foreign key to an ExclusionGroup; ensures mutual exclusivity within the group.
  • targeting_rules: JSONB with predicates that filter eligible users (country, platform, cohort).

Deterministic Assignment

The assignment algorithm must be stateless, fast, and consistent across every service that evaluates experiments:

bucket = int(sha256(experiment_key + ":" + user_id).hexdigest(), 16) % 100

If bucket < traffic_pct, the user is enrolled. The variant is chosen by mapping the bucket (within the enrolled range) against cumulative variant weights. Because the hash is deterministic, no assignment record needs to be written before first use — the assignment is computed on the fly and stored asynchronously for analysis.

Mutual Exclusion Groups

Without mutual exclusion, two experiments testing different parts of the checkout flow could simultaneously affect the same user, making it impossible to attribute an outcome change to either experiment. Exclusion groups partition the user space:

  • Each experiment belongs to at most one exclusion group.
  • Within a group, the hash namespace shifts: bucket = hash(group_id + ":" + user_id) % 100. Each experiment in the group is allocated a non-overlapping slice of that 0-99 range.
  • Experiments in different groups use independent hash namespaces and can freely overlap.

Holdout Groups

A holdout is a slice of users permanently excluded from all experiments in a given product area. For example, 5% of users never see any experiment in the checkout funnel. After six months, comparing holdout users (pure control) to the rest of the population reveals the cumulative effect of all shipped experiments — including interaction effects that individual experiment measurements miss. Holdouts are implemented as a special exclusion group with traffic_pct = 0 for all experiments.

Metric Event Logging

Every user action that is a candidate success metric is logged as a MetricEvent row:

  • user_id, experiment_id, variant, metric_name, value, event_at.
  • Binary metrics (converted: yes/no) use value 0 or 1.
  • Continuous metrics (revenue, session duration) use the raw numeric value.
  • Events are written to Kafka and consumed by a streaming aggregator that pre-computes per-variant sums and counts for fast query performance.

Statistical Analysis

Test Selection

  • Binary metrics (conversion rate, click-through rate): two-proportion z-test.
  • Continuous metrics (revenue, latency): Welch t-test (does not assume equal variance).
  • Ratio metrics (revenue per session): delta method to compute variance of a ratio.

Sequential Testing for Early Stopping

Classical tests require a fixed sample size determined before the experiment starts. Sequential testing allows checking results continuously while controlling the false positive rate. The platform uses alpha spending: a budget of false positive probability is allocated across planned interim looks using O'Brien-Fleming boundaries. An experiment may be stopped early only if the test statistic exceeds the boundary for that look — a far more stringent threshold than the final alpha, compensating for the increased probability of false positives from repeated testing.

Guardrail Metrics

Guardrail metrics protect core business health. They are evaluated on every analysis cycle (typically hourly) for every active experiment. If any guardrail metric degrades beyond its threshold in the treatment variant relative to control, the platform:

  1. Sets the experiment status to stopped.
  2. Routes 100% of traffic to the control variant.
  3. Creates a PagerDuty incident assigned to the experiment owner.
  4. Records the guardrail breach in ExperimentResult with a negative significance flag.

SQL Schema

CREATE TABLE ExclusionGroup (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(255) NOT NULL UNIQUE
);

CREATE TABLE Experiment (
    id                  BIGSERIAL PRIMARY KEY,
    key                 VARCHAR(255) NOT NULL UNIQUE,
    status              VARCHAR(50)  NOT NULL DEFAULT 'draft',  -- draft|running|stopped|concluded
    variants            JSONB        NOT NULL,  -- [{name: "control", weight: 50}, {name: "treatment", weight: 50}]
    traffic_pct         INT          NOT NULL DEFAULT 100,
    targeting_rules     JSONB,
    start_at            TIMESTAMPTZ,
    end_at              TIMESTAMPTZ,
    exclusion_group_id  BIGINT REFERENCES ExclusionGroup(id),
    created_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE TABLE ExperimentAssignment (
    user_id         VARCHAR(255) NOT NULL,
    experiment_id   BIGINT       NOT NULL REFERENCES Experiment(id),
    variant         VARCHAR(100) NOT NULL,
    assigned_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, experiment_id)
);

CREATE TABLE MetricEvent (
    id              BIGSERIAL PRIMARY KEY,
    user_id         VARCHAR(255) NOT NULL,
    experiment_id   BIGINT       NOT NULL REFERENCES Experiment(id),
    variant         VARCHAR(100) NOT NULL,
    metric_name     VARCHAR(255) NOT NULL,
    value           FLOAT        NOT NULL,
    event_at        TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
CREATE INDEX ON MetricEvent (experiment_id, metric_name, variant, event_at DESC);

CREATE TABLE ExperimentResult (
    experiment_id   BIGINT       NOT NULL REFERENCES Experiment(id),
    metric_name     VARCHAR(255) NOT NULL,
    variant         VARCHAR(100) NOT NULL,
    mean            FLOAT        NOT NULL,
    ci_lower        FLOAT        NOT NULL,
    ci_upper        FLOAT        NOT NULL,
    p_value         FLOAT        NOT NULL,
    significant     BOOLEAN      NOT NULL DEFAULT FALSE,
    computed_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (experiment_id, metric_name, variant, computed_at)
);

Python Interface

import hashlib, json, math
from datetime import datetime

def assign_variant(experiment_key: str, user_id: str, experiment: dict) -> str | None:
    """
    Returns assigned variant name or None if user is not enrolled.
    experiment: dict with keys traffic_pct (int) and variants (list of {name, weight}).
    """
    raw_hash = hashlib.sha256(f"{experiment_key}:{user_id}".encode()).hexdigest()
    bucket = int(raw_hash, 16) % 100
    if bucket >= experiment["traffic_pct"]:
        return None  # not enrolled
    enrolled_bucket = int(raw_hash, 16) % experiment["traffic_pct"]
    cumulative = 0
    for variant in experiment["variants"]:
        cumulative += variant["weight"]
        if enrolled_bucket  None:
    # INSERT INTO MetricEvent (user_id, experiment_id, variant, metric_name, value)
    # experiment_id looked up by key; variant computed by assign_variant
    pass

def compute_results(experiment_id: int, metric_name: str) -> dict:
    """
    Compute per-variant stats and run Welch t-test between treatment and control.
    Returns dict with mean, CI, p_value, significant for each variant.
    """
    # Fetch per-variant (count, sum, sum_of_squares) from MetricEvent
    # Welch t-test: t = (mean_t - mean_c) / sqrt(var_t/n_t + var_c/n_c)
    # df via Welch-Satterthwaite equation
    # p_value from scipy.stats.t.sf(abs(t), df) * 2
    return {}  # stub

def _welch_t_test(n1, mean1, var1, n2, mean2, var2, alpha=0.05) -> tuple[float, bool]:
    """Returns (p_value, significant)."""
    se = math.sqrt(var1 / n1 + var2 / n2)
    if se == 0:
        return 1.0, False
    t_stat = (mean1 - mean2) / se
    # Welch-Satterthwaite degrees of freedom
    df_num = (var1 / n1 + var2 / n2) ** 2
    df_den = (var1 / n1) ** 2 / (n1 - 1) + (var2 / n2) ** 2 / (n2 - 1)
    df = df_num / df_den if df_den > 0 else 1
    # Approximate p-value using normal distribution for large df
    from math import erfc, sqrt
    p_value = erfc(abs(t_stat) / sqrt(2))
    return p_value, p_value  list[str]:
    """
    Returns list of guardrail metric names that have breached their thresholds.
    guardrail_metrics: list of {metric_name, threshold_relative, direction} dicts.
    direction: "increase_bad" or "decrease_bad".
    """
    breached = []
    for gm in guardrail_metrics:
        result = compute_results(experiment_id, gm["metric_name"])
        if not result:
            continue
        control_mean = result.get("control", {}).get("mean", 0)
        treatment_mean = result.get("treatment", {}).get("mean", 0)
        if control_mean == 0:
            continue
        relative_change = (treatment_mean - control_mean) / abs(control_mean)
        if gm["direction"] == "increase_bad" and relative_change > gm["threshold_relative"]:
            breached.append(gm["metric_name"])
        elif gm["direction"] == "decrease_bad" and relative_change < -gm["threshold_relative"]:
            breached.append(gm["metric_name"])
    return breached

Operational Considerations

  • MetricEvent volume: High-traffic products generate billions of metric events per day. Pre-aggregate counts and sums in a streaming job (Flink or Spark Structured Streaming) so the analysis query reads aggregates rather than raw rows.
  • Assignment logging: Write assignments asynchronously via Kafka to avoid adding latency to the assignment hot path. Accept that some assignments may be logged with a small delay.
  • Experiment SDK: Distribute a thin client library (Go, Python, JavaScript) that implements the deterministic hash locally. This eliminates a network call for every experiment evaluation — critical for high-QPS services.
  • Novelty effect: Users behave differently when they first encounter a change. Run experiments long enough (typically at least one full week) to wash out novelty effects before concluding.
  • Network effects: For social or marketplace products, standard A/B assignment violates the stable unit treatment value assumption (SUTVA) because treated and control users interact. Use cluster-based or switchback designs in those contexts.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

Scroll to Top