System Design: A/B Testing Platform
An A/B testing platform lets product teams run controlled experiments – assigning users to variants, collecting metrics, and determining statistical significance to make data-driven decisions. This is a common system design interview topic at Meta, Databricks, and Shopify.
1. Core Entities
Experiment
experiment_id UUID PRIMARY KEY
name VARCHAR NOT NULL
status ENUM('DRAFT','RUNNING','PAUSED','CONCLUDED')
targeting_rules JSONB -- e.g. {"country": "US", "platform": "iOS"}
start_date TIMESTAMPTZ
end_date TIMESTAMPTZ
Variant
variant_id UUID PRIMARY KEY
experiment_id UUID REFERENCES experiments
name VARCHAR -- e.g. "control", "treatment_A"
weight INT -- 0-100, sum across variants = 100
config JSONB -- feature flags / parameters for this variant
Assignment
user_id BIGINT
experiment_id UUID
variant_id UUID
assigned_at TIMESTAMPTZ
assignment_hash INT -- stored for auditability
PRIMARY KEY (user_id, experiment_id)
Metric
metric_id UUID PRIMARY KEY
experiment_id UUID REFERENCES experiments
name VARCHAR -- e.g. "checkout_conversion"
type ENUM('CONVERSION','REVENUE','LATENCY')
aggregation ENUM('MEAN','PROPORTION')
2. Deterministic User Assignment
Assignment must be stable (same user always gets the same variant) and fast (no DB lookup on every request). Use a hash function:
import hashlib
def assign_variant(user_id: int, experiment_id: str, variants: list) -> str:
key = f"{user_id}:{experiment_id}"
h = int(hashlib.md5(key.encode()).hexdigest(), 16) % 100
cumulative = 0
for variant in sorted(variants, key=lambda v: v['name']):
cumulative += variant['weight']
if h < cumulative:
return variant['variant_id']
return variants[-1]['variant_id']
Properties of this design:
- Deterministic – no storage needed at assignment time; can recompute on any node.
- Uniform distribution – MD5 output is uniformly distributed mod 100.
- Stable – changing experiment parameters (name, targeting) does not affect the hash; only changing weights rerandomizes a subset of users.
3. Metric Collection Pipeline
Client Events (page_view, purchase, click, latency_sample)
|
v
Kafka Topics (partitioned by user_id for ordering)
|
+---> Flink Real-Time Job
| - Join event stream with assignment store (Kafka topic or Redis)
| - Compute per-variant aggregates (count, sum, sum_of_squares)
| - Emit to metrics_warehouse every 1 minute
|
+---> Batch Job (hourly/daily)
- Recompute from raw event log for accuracy
- Write to metrics_warehouse as authoritative values
Store running aggregates: (count, sum, sum_of_squares) per variant per metric per day. These are sufficient to compute mean, variance, and p-values without storing individual events.
4. Statistical Significance Testing
Two-Sample t-test (for MEAN metrics)
For metrics like revenue or latency where you compare means:
t = (mean_A - mean_B) / sqrt(var_A/n_A + var_B/n_B)
Under H0 (no difference), t follows a t-distribution. Reject H0 if |t| > t_critical for the chosen alpha (typically 0.05).
Chi-Squared Test (for PROPORTION metrics)
For conversion rates (binary outcomes):
Expected_A = (conv_A + conv_B) * n_A / (n_A + n_B)
Expected_B = (conv_A + conv_B) * n_B / (n_A + n_B)
chi2 = (conv_A - Expected_A)^2 / Expected_A
+ (conv_B - Expected_B)^2 / Expected_B
Reject H0 if chi2 > 3.841 (chi-squared critical value at alpha=0.05, df=1).
Sequential Testing and Alpha Spending
Peeking at results repeatedly inflates the false positive rate (multiple comparisons problem). Use an alpha spending function (e.g. O’Brien-Fleming) that allows early stopping while controlling the overall false positive rate at alpha=0.05. Each interim look consumes part of the alpha budget.
5. Experiment Isolation
Orthogonal Experiments
A user can participate in multiple experiments simultaneously if the experiments affect independent features. Orthogonality is declared by product owners; the platform enforces no technical restriction between orthogonal experiments.
Mutex Groups
When two experiments might interact (e.g. both change the checkout flow), assign them to the same mutex group. Within a mutex group, a user is assigned to at most one experiment, partitioning the user population:
mutex_group_id UUID
experiment_id UUID
mutex_slot_range INT RANGE -- e.g. [0, 50) for first exp, [50, 100) for second
6. Rollout Mechanics
Gradual Rollout
Increase variant.weight over time: 0% -> 10% -> 50% -> 100%. Because assignment is hash-based, users who were in the treatment group at 10% are still in it at 50% (the hash threshold simply grows).
Feature Flags as Single-Variant Experiments
A feature flag is a degenerate experiment with one “on” variant and weight=X%. This unifies the system – feature flags benefit from the same assignment stability, targeting rules, and audit trail as full experiments.
Scale Estimates
- Assignment: stateless hash computation; horizontal scale to millions of RPS.
- Event ingestion: Kafka handles billions of events/day; partition by user_id.
- Metric reads: materialized aggregates serve dashboard queries in milliseconds.
- Experiment metadata: small dataset (thousands of experiments), fits in Redis with DB as source of truth.
Interview Tips
- Always start with the assignment algorithm – interviewers want to see you derive hash-based deterministic assignment, not database lookups.
- Distinguish real-time (Flink) from batch (authoritative) metric pipelines and explain why both are needed.
- The sequential testing / peeking problem is a differentiator – most candidates miss it.
- Mutex groups vs orthogonal experiments is a common follow-up; be ready to explain the tradeoff (isolation vs statistical power).