A/B Experiment Platform Low-Level Design: Assignment, Holdout Groups, Metric Collection, and Statistical Analysis

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does deterministic user assignment work in an A/B platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The assignment service computes hash(experiment_key + user_id) mod 100 to obtain a bucket in the range 0-99. If the bucket falls within the experiment's traffic percentage, the user is enrolled. The variant is determined by mapping the bucket to variant ranges. Because the hash is deterministic, the same user always receives the same variant for the same experiment — even across sessions, devices, and services — without storing the assignment until the first exposure.”
}
},
{
“@type”: “Question”,
“name”: “What is mutual exclusion in experiment platforms?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Mutual exclusion ensures a user cannot be simultaneously enrolled in two experiments that would interact and contaminate results. Experiments are grouped into exclusion groups. Within a group, the user's assignment hash is used to allocate non-overlapping traffic slices, so each user appears in at most one experiment per group. Experiments in different groups can overlap freely.”
}
},
{
“@type”: “Question”,
“name”: “How does sequential testing prevent false positives from peeking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Classical hypothesis tests assume you check the result once after a fixed sample size. Checking results repeatedly (peeking) inflates the false positive rate dramatically — in a two-sided test at alpha=0.05, peeking daily for a week can raise the actual false positive rate above 20%. Sequential testing methods (e.g., always-valid p-values, alpha spending with O'Brien-Fleming boundaries, or mSPRT) adjust the significance threshold for each interim look, maintaining the desired overall false positive rate while allowing early stopping when evidence is strong.”
}
},
{
“@type”: “Question”,
“name”: “How do guardrail metrics auto-stop an experiment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Guardrail metrics are core health indicators — latency p99, error rate, revenue per user — that must not be harmed by any experiment. The analysis pipeline computes these metrics for each active experiment on every evaluation cycle. If a guardrail metric degrades beyond a pre-set threshold (e.g., error rate increases by more than 10% relative), the platform automatically sets the experiment status to stopped and routes all traffic back to the control variant, notifying the owning team immediately.”
}
}
]
}

Why a Dedicated Experiment Platform?

A/B testing sounds simple — split users, measure a metric, run a t-test — but at scale it involves subtle correctness requirements: preventing the same user from seeing different variants across sessions, avoiding experiment interactions that corrupt results, correcting for the multiple-comparisons problem when running hundreds of simultaneous experiments, and stopping experiments early without inflating false positive rates. A dedicated platform encodes these requirements so product teams get correct results without becoming statisticians.

Experiment Model

An experiment is defined by:

key: a stable string identifier (e.g., checkout_button_color_v2).
variants: JSONB array of variant objects, each with a name and weight. Weights must sum to 100.
traffic_pct: percent of eligible users to enroll (0-100). Set below 100 to reduce blast radius.
start_at / end_at: experiment window.
exclusion_group_id: foreign key to an ExclusionGroup; ensures mutual exclusivity within the group.
targeting_rules: JSONB with predicates that filter eligible users (country, platform, cohort).

Deterministic Assignment

The assignment algorithm must be stateless, fast, and consistent across every service that evaluates experiments:

bucket = int(sha256(experiment_key + ":" + user_id).hexdigest(), 16) % 100

If bucket < traffic_pct, the user is enrolled. The variant is chosen by mapping the bucket (within the enrolled range) against cumulative variant weights. Because the hash is deterministic, no assignment record needs to be written before first use — the assignment is computed on the fly and stored asynchronously for analysis.

Mutual Exclusion Groups

Without mutual exclusion, two experiments testing different parts of the checkout flow could simultaneously affect the same user, making it impossible to attribute an outcome change to either experiment. Exclusion groups partition the user space:

Each experiment belongs to at most one exclusion group.
Within a group, the hash namespace shifts: bucket = hash(group_id + ":" + user_id) % 100. Each experiment in the group is allocated a non-overlapping slice of that 0-99 range.
Experiments in different groups use independent hash namespaces and can freely overlap.

Holdout Groups

A holdout is a slice of users permanently excluded from all experiments in a given product area. For example, 5% of users never see any experiment in the checkout funnel. After six months, comparing holdout users (pure control) to the rest of the population reveals the cumulative effect of all shipped experiments — including interaction effects that individual experiment measurements miss. Holdouts are implemented as a special exclusion group with traffic_pct = 0 for all experiments.

Metric Event Logging

Every user action that is a candidate success metric is logged as a MetricEvent row:

user_id, experiment_id, variant, metric_name, value, event_at.
Binary metrics (converted: yes/no) use value 0 or 1.
Continuous metrics (revenue, session duration) use the raw numeric value.
Events are written to Kafka and consumed by a streaming aggregator that pre-computes per-variant sums and counts for fast query performance.

Statistical Analysis

Test Selection

Binary metrics (conversion rate, click-through rate): two-proportion z-test.
Continuous metrics (revenue, latency): Welch t-test (does not assume equal variance).
Ratio metrics (revenue per session): delta method to compute variance of a ratio.

Sequential Testing for Early Stopping

Classical tests require a fixed sample size determined before the experiment starts. Sequential testing allows checking results continuously while controlling the false positive rate. The platform uses alpha spending: a budget of false positive probability is allocated across planned interim looks using O'Brien-Fleming boundaries. An experiment may be stopped early only if the test statistic exceeds the boundary for that look — a far more stringent threshold than the final alpha, compensating for the increased probability of false positives from repeated testing.

Guardrail Metrics

Guardrail metrics protect core business health. They are evaluated on every analysis cycle (typically hourly) for every active experiment. If any guardrail metric degrades beyond its threshold in the treatment variant relative to control, the platform:

Sets the experiment status to stopped.
Routes 100% of traffic to the control variant.
Creates a PagerDuty incident assigned to the experiment owner.
Records the guardrail breach in ExperimentResult with a negative significance flag.

SQL Schema

CREATE TABLE ExclusionGroup (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(255) NOT NULL UNIQUE
);

CREATE TABLE Experiment (
    id                  BIGSERIAL PRIMARY KEY,
    key                 VARCHAR(255) NOT NULL UNIQUE,
    status              VARCHAR(50)  NOT NULL DEFAULT 'draft',  -- draft|running|stopped|concluded
    variants            JSONB        NOT NULL,  -- [{name: "control", weight: 50}, {name: "treatment", weight: 50}]
    traffic_pct         INT          NOT NULL DEFAULT 100,
    targeting_rules     JSONB,
    start_at            TIMESTAMPTZ,
    end_at              TIMESTAMPTZ,
    exclusion_group_id  BIGINT REFERENCES ExclusionGroup(id),
    created_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE TABLE ExperimentAssignment (
    user_id         VARCHAR(255) NOT NULL,
    experiment_id   BIGINT       NOT NULL REFERENCES Experiment(id),
    variant         VARCHAR(100) NOT NULL,
    assigned_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, experiment_id)
);

CREATE TABLE MetricEvent (
    id              BIGSERIAL PRIMARY KEY,
    user_id         VARCHAR(255) NOT NULL,
    experiment_id   BIGINT       NOT NULL REFERENCES Experiment(id),
    variant         VARCHAR(100) NOT NULL,
    metric_name     VARCHAR(255) NOT NULL,
    value           FLOAT        NOT NULL,
    event_at        TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
CREATE INDEX ON MetricEvent (experiment_id, metric_name, variant, event_at DESC);

CREATE TABLE ExperimentResult (
    experiment_id   BIGINT       NOT NULL REFERENCES Experiment(id),
    metric_name     VARCHAR(255) NOT NULL,
    variant         VARCHAR(100) NOT NULL,
    mean            FLOAT        NOT NULL,
    ci_lower        FLOAT        NOT NULL,
    ci_upper        FLOAT        NOT NULL,
    p_value         FLOAT        NOT NULL,
    significant     BOOLEAN      NOT NULL DEFAULT FALSE,
    computed_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (experiment_id, metric_name, variant, computed_at)
);

Python Interface

import hashlib, json, math
from datetime import datetime

def assign_variant(experiment_key: str, user_id: str, experiment: dict) -> str | None:
    """
    Returns assigned variant name or None if user is not enrolled.
    experiment: dict with keys traffic_pct (int) and variants (list of {name, weight}).
    """
    raw_hash = hashlib.sha256(f"{experiment_key}:{user_id}".encode()).hexdigest()
    bucket = int(raw_hash, 16) % 100
    if bucket >= experiment["traffic_pct"]:
        return None  # not enrolled
    enrolled_bucket = int(raw_hash, 16) % experiment["traffic_pct"]
    cumulative = 0
    for variant in experiment["variants"]:
        cumulative += variant["weight"]
        if enrolled_bucket  None:
    # INSERT INTO MetricEvent (user_id, experiment_id, variant, metric_name, value)
    # experiment_id looked up by key; variant computed by assign_variant
    pass

def compute_results(experiment_id: int, metric_name: str) -> dict:
    """
    Compute per-variant stats and run Welch t-test between treatment and control.
    Returns dict with mean, CI, p_value, significant for each variant.
    """
    # Fetch per-variant (count, sum, sum_of_squares) from MetricEvent
    # Welch t-test: t = (mean_t - mean_c) / sqrt(var_t/n_t + var_c/n_c)
    # df via Welch-Satterthwaite equation
    # p_value from scipy.stats.t.sf(abs(t), df) * 2
    return {}  # stub

def _welch_t_test(n1, mean1, var1, n2, mean2, var2, alpha=0.05) -> tuple[float, bool]:
    """Returns (p_value, significant)."""
    se = math.sqrt(var1 / n1 + var2 / n2)
    if se == 0:
        return 1.0, False
    t_stat = (mean1 - mean2) / se
    # Welch-Satterthwaite degrees of freedom
    df_num = (var1 / n1 + var2 / n2) ** 2
    df_den = (var1 / n1) ** 2 / (n1 - 1) + (var2 / n2) ** 2 / (n2 - 1)
    df = df_num / df_den if df_den > 0 else 1
    # Approximate p-value using normal distribution for large df
    from math import erfc, sqrt
    p_value = erfc(abs(t_stat) / sqrt(2))
    return p_value, p_value  list[str]:
    """
    Returns list of guardrail metric names that have breached their thresholds.
    guardrail_metrics: list of {metric_name, threshold_relative, direction} dicts.
    direction: "increase_bad" or "decrease_bad".
    """
    breached = []
    for gm in guardrail_metrics:
        result = compute_results(experiment_id, gm["metric_name"])
        if not result:
            continue
        control_mean = result.get("control", {}).get("mean", 0)
        treatment_mean = result.get("treatment", {}).get("mean", 0)
        if control_mean == 0:
            continue
        relative_change = (treatment_mean - control_mean) / abs(control_mean)
        if gm["direction"] == "increase_bad" and relative_change > gm["threshold_relative"]:
            breached.append(gm["metric_name"])
        elif gm["direction"] == "decrease_bad" and relative_change < -gm["threshold_relative"]:
            breached.append(gm["metric_name"])
    return breached

Operational Considerations

MetricEvent volume: High-traffic products generate billions of metric events per day. Pre-aggregate counts and sums in a streaming job (Flink or Spark Structured Streaming) so the analysis query reads aggregates rather than raw rows.
Assignment logging: Write assignments asynchronously via Kafka to avoid adding latency to the assignment hot path. Accept that some assignments may be logged with a small delay.
Experiment SDK: Distribute a thin client library (Go, Python, JavaScript) that implements the deterministic hash locally. This eliminates a network call for every experiment evaluation — critical for high-QPS services.
Novelty effect: Users behave differently when they first encounter a change. Run experiments long enough (typically at least one full week) to wash out novelty effects before concluding.
Network effects: For social or marketplace products, standard A/B assignment violates the stable unit treatment value assumption (SUTVA) because treated and control users interact. Use cluster-based or switchback designs in those contexts.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does deterministic variant assignment work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SHA-256 of (experiment_key + user_id) is converted to an integer modulo 100; the result is compared against cumulative variant weights to select the assigned variant, ensuring the same user always sees the same variant.”
}
},
{
“@type”: “Question”,
“name”: “How does mutual exclusion between experiments work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Experiments in the same exclusion_group share a namespace; a user's assignment bucket in the group is used for all experiments in the group, ensuring each user is in at most one experiment per group at a time.”
}
},
{
“@type”: “Question”,
“name”: “How does sequential testing prevent false positives from peeking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Instead of a fixed sample size, sequential testing uses an alpha spending function (e.g., O'Brien-Fleming) that adjusts the significance threshold for each interim look, controlling the family-wise error rate across multiple analyses.”
}
},
{
“@type”: “Question”,
“name”: “How do guardrail metrics auto-stop an experiment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “check_guardrails() computes the guardrail metric (e.g., p99 latency, error rate) for each variant; if the treatment variant shows a statistically significant degradation versus control beyond the guardrail threshold, the experiment status is set to STOPPED and traffic reverts to control.”
}
}
]
}