System Design: A/B Testing Platform

An A/B testing platform lets product teams run controlled experiments – assigning users to variants, collecting metrics, and determining statistical significance to make data-driven decisions. This is a common system design interview topic at Meta, Databricks, and Shopify.

1. Core Entities

Experiment

experiment_id     UUID PRIMARY KEY
name              VARCHAR NOT NULL
status            ENUM('DRAFT','RUNNING','PAUSED','CONCLUDED')
targeting_rules   JSONB          -- e.g. {"country": "US", "platform": "iOS"}
start_date        TIMESTAMPTZ
end_date          TIMESTAMPTZ

Variant

variant_id        UUID PRIMARY KEY
experiment_id     UUID REFERENCES experiments
name              VARCHAR        -- e.g. "control", "treatment_A"
weight            INT            -- 0-100, sum across variants = 100
config            JSONB          -- feature flags / parameters for this variant

Assignment

user_id           BIGINT
experiment_id     UUID
variant_id        UUID
assigned_at       TIMESTAMPTZ
assignment_hash   INT            -- stored for auditability
PRIMARY KEY (user_id, experiment_id)

Metric

metric_id         UUID PRIMARY KEY
experiment_id     UUID REFERENCES experiments
name              VARCHAR        -- e.g. "checkout_conversion"
type              ENUM('CONVERSION','REVENUE','LATENCY')
aggregation       ENUM('MEAN','PROPORTION')

2. Deterministic User Assignment

Assignment must be stable (same user always gets the same variant) and fast (no DB lookup on every request). Use a hash function:

import hashlib

def assign_variant(user_id: int, experiment_id: str, variants: list) -> str:
    key = f"{user_id}:{experiment_id}"
    h = int(hashlib.md5(key.encode()).hexdigest(), 16) % 100
    cumulative = 0
    for variant in sorted(variants, key=lambda v: v['name']):
        cumulative += variant['weight']
        if h < cumulative:
            return variant['variant_id']
    return variants[-1]['variant_id']

Properties of this design:

Deterministic – no storage needed at assignment time; can recompute on any node.
Uniform distribution – MD5 output is uniformly distributed mod 100.
Stable – changing experiment parameters (name, targeting) does not affect the hash; only changing weights rerandomizes a subset of users.

3. Metric Collection Pipeline

Client Events (page_view, purchase, click, latency_sample)
    |
    v
Kafka Topics (partitioned by user_id for ordering)
    |
    +---> Flink Real-Time Job
    |         - Join event stream with assignment store (Kafka topic or Redis)
    |         - Compute per-variant aggregates (count, sum, sum_of_squares)
    |         - Emit to metrics_warehouse every 1 minute
    |
    +---> Batch Job (hourly/daily)
              - Recompute from raw event log for accuracy
              - Write to metrics_warehouse as authoritative values

Store running aggregates: (count, sum, sum_of_squares) per variant per metric per day. These are sufficient to compute mean, variance, and p-values without storing individual events.

4. Statistical Significance Testing

Two-Sample t-test (for MEAN metrics)

For metrics like revenue or latency where you compare means:

t = (mean_A - mean_B) / sqrt(var_A/n_A + var_B/n_B)

Under H0 (no difference), t follows a t-distribution. Reject H0 if |t| > t_critical for the chosen alpha (typically 0.05).

Chi-Squared Test (for PROPORTION metrics)

For conversion rates (binary outcomes):

Expected_A = (conv_A + conv_B) * n_A / (n_A + n_B)
Expected_B = (conv_A + conv_B) * n_B / (n_A + n_B)
chi2 = (conv_A - Expected_A)^2 / Expected_A
     + (conv_B - Expected_B)^2 / Expected_B

Reject H0 if chi2 > 3.841 (chi-squared critical value at alpha=0.05, df=1).

Sequential Testing and Alpha Spending

Peeking at results repeatedly inflates the false positive rate (multiple comparisons problem). Use an alpha spending function (e.g. O’Brien-Fleming) that allows early stopping while controlling the overall false positive rate at alpha=0.05. Each interim look consumes part of the alpha budget.

5. Experiment Isolation

Orthogonal Experiments

A user can participate in multiple experiments simultaneously if the experiments affect independent features. Orthogonality is declared by product owners; the platform enforces no technical restriction between orthogonal experiments.

Mutex Groups

When two experiments might interact (e.g. both change the checkout flow), assign them to the same mutex group. Within a mutex group, a user is assigned to at most one experiment, partitioning the user population:

mutex_group_id    UUID
experiment_id     UUID
mutex_slot_range  INT RANGE  -- e.g. [0, 50) for first exp, [50, 100) for second

6. Rollout Mechanics

Gradual Rollout

Increase variant.weight over time: 0% -> 10% -> 50% -> 100%. Because assignment is hash-based, users who were in the treatment group at 10% are still in it at 50% (the hash threshold simply grows).

Feature Flags as Single-Variant Experiments

A feature flag is a degenerate experiment with one “on” variant and weight=X%. This unifies the system – feature flags benefit from the same assignment stability, targeting rules, and audit trail as full experiments.

Scale Estimates

Assignment: stateless hash computation; horizontal scale to millions of RPS.
Event ingestion: Kafka handles billions of events/day; partition by user_id.
Metric reads: materialized aggregates serve dashboard queries in milliseconds.
Experiment metadata: small dataset (thousands of experiments), fits in Redis with DB as source of truth.

Interview Tips

Always start with the assignment algorithm – interviewers want to see you derive hash-based deterministic assignment, not database lookups.
Distinguish real-time (Flink) from batch (authoritative) metric pipelines and explain why both are needed.
The sequential testing / peeking problem is a differentiator – most candidates miss it.
Mutex groups vs orthogonal experiments is a common follow-up; be ready to explain the tradeoff (isolation vs statistical power).