A/B Experiment System: Low-Level Design
An A/B experiment system assigns users to treatment groups, measures whether a metric changes between groups, and decides whether to ship or revert. It differs from a feature flag system in one key way: assignments must be logged and frozen — a user must stay in the same variant for the entire experiment duration, even if they clear cookies or switch devices. Statistical validity depends on stable assignment.
Core Data Model
CREATE TABLE Experiment (
experiment_id SERIAL PRIMARY KEY,
experiment_key VARCHAR(100) UNIQUE NOT NULL, -- "checkout_button_color"
description TEXT,
status VARCHAR(20) NOT NULL DEFAULT 'draft', -- draft, running, paused, concluded
traffic_pct SMALLINT NOT NULL DEFAULT 100, -- % of eligible users enrolled
started_at TIMESTAMPTZ,
concluded_at TIMESTAMPTZ,
winning_variant VARCHAR(50),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE ExperimentVariant (
variant_id SERIAL PRIMARY KEY,
experiment_id INT NOT NULL REFERENCES Experiment(experiment_id) ON DELETE CASCADE,
variant_key VARCHAR(50) NOT NULL, -- "control", "treatment_a", "treatment_b"
weight SMALLINT NOT NULL DEFAULT 50, -- relative traffic weight
UNIQUE (experiment_id, variant_key)
);
CREATE TABLE ExperimentAssignment (
assignment_id BIGSERIAL PRIMARY KEY,
experiment_id INT NOT NULL REFERENCES Experiment(experiment_id),
user_id BIGINT NOT NULL,
variant_id INT NOT NULL REFERENCES ExperimentVariant(variant_id),
assigned_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (experiment_id, user_id) -- one assignment per user per experiment
);
CREATE TABLE ExperimentEvent (
event_id BIGSERIAL PRIMARY KEY,
experiment_id INT NOT NULL,
user_id BIGINT NOT NULL,
variant_id INT NOT NULL,
event_type VARCHAR(100) NOT NULL, -- "page_view", "checkout_complete", "revenue"
value NUMERIC(12,4), -- for metric events (revenue amount)
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Partition ExperimentEvent by month — events accumulate quickly
CREATE INDEX ON ExperimentAssignment(experiment_id, variant_id);
CREATE INDEX ON ExperimentEvent(experiment_id, event_type, occurred_at);
Assignment Engine
import hashlib, json
from typing import Optional
def get_variant(experiment_key: str, user_id: int) -> Optional[str]:
"""
Returns the variant key for this user, or None if not enrolled.
Assignment is deterministic and sticky: same user always gets the same variant.
"""
exp = _load_experiment(experiment_key)
if not exp or exp['status'] != 'running':
return None
# Step 1: enrollment gate — only traffic_pct% of users enter the experiment
enroll_hash = int(hashlib.md5(f"enroll:{experiment_key}:{user_id}".encode()).hexdigest()[:8], 16) % 100
if enroll_hash >= exp['traffic_pct']:
return None # not enrolled
# Step 2: check for existing assignment (sticky)
existing = db.fetchone(
"""SELECT v.variant_key FROM ExperimentAssignment a
JOIN ExperimentVariant v USING (variant_id)
WHERE a.experiment_id=%s AND a.user_id=%s""",
(exp['experiment_id'], user_id)
)
if existing:
return existing['variant_key']
# Step 3: assign to variant by weighted hash
variant = _assign_variant(experiment_key, user_id, exp['variants'])
# Step 4: persist assignment (INSERT ... ON CONFLICT DO NOTHING handles race)
db.execute("""
INSERT INTO ExperimentAssignment (experiment_id, user_id, variant_id)
VALUES (%s, %s, %s) ON CONFLICT DO NOTHING
""", (exp['experiment_id'], user_id, variant['variant_id']))
# Re-read in case another request raced us
row = db.fetchone(
"""SELECT v.variant_key FROM ExperimentAssignment a
JOIN ExperimentVariant v USING (variant_id)
WHERE a.experiment_id=%s AND a.user_id=%s""",
(exp['experiment_id'], user_id)
)
return row['variant_key'] if row else variant['variant_key']
def _assign_variant(experiment_key: str, user_id: int, variants: list) -> dict:
"""
Weighted assignment: variants have weights [50, 50] → even split.
Hash(experiment_key + user_id) % total_weight maps to a variant bucket.
"""
total_weight = sum(v['weight'] for v in variants)
bucket = int(hashlib.md5(f"assign:{experiment_key}:{user_id}".encode()).hexdigest()[:8], 16) % total_weight
cumulative = 0
for v in sorted(variants, key=lambda x: x['variant_id']):
cumulative += v['weight']
if bucket < cumulative:
return v
return variants[-1] # fallback
Event Tracking
def track_event(experiment_key: str, user_id: int, event_type: str, value: float = None):
"""
Track a metric event for a user. Only records if the user is in the experiment.
Called from application code on meaningful actions.
"""
exp = _load_experiment(experiment_key)
if not exp or exp['status'] != 'running':
return
assignment = db.fetchone(
"SELECT variant_id FROM ExperimentAssignment WHERE experiment_id=%s AND user_id=%s",
(exp['experiment_id'], user_id)
)
if not assignment:
return # user not enrolled — don't record
db.execute("""
INSERT INTO ExperimentEvent (experiment_id, user_id, variant_id, event_type, value, occurred_at)
VALUES (%s, %s, %s, %s, %s, NOW())
""", (exp['experiment_id'], user_id, assignment['variant_id'], event_type, value))
# Application usage:
# variant = get_variant('checkout_button_color', user.id)
# if variant == 'treatment_green':
# render_green_button()
# else:
# render_blue_button() # control
# ...
# On conversion:
# track_event('checkout_button_color', user.id, 'checkout_complete')
# track_event('checkout_button_color', user.id, 'revenue', value=order.total)
Results Aggregation Query
-- Per-variant metrics for an experiment
SELECT
v.variant_key,
COUNT(DISTINCT a.user_id) AS users,
COUNT(DISTINCT CASE WHEN e.event_type='checkout_complete'
THEN e.user_id END) AS conversions,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN e.event_type='checkout_complete'
THEN e.user_id END)
/ NULLIF(COUNT(DISTINCT a.user_id), 0), 2) AS conversion_rate_pct,
ROUND(SUM(CASE WHEN e.event_type='revenue' THEN e.value ELSE 0 END)
/ NULLIF(COUNT(DISTINCT a.user_id), 0), 4) AS revenue_per_user
FROM ExperimentVariant v
JOIN ExperimentAssignment a USING (variant_id)
LEFT JOIN ExperimentEvent e ON e.experiment_id=v.experiment_id
AND e.user_id=a.user_id AND e.event_type IN ('checkout_complete','revenue')
WHERE v.experiment_id = $experiment_id
GROUP BY v.variant_key
ORDER BY v.variant_key;
Statistical Significance Check
from scipy.stats import chi2_contingency, ttest_ind
import numpy as np
def check_significance(control_n, control_conv, treatment_n, treatment_conv,
alpha=0.05) -> dict:
"""
Two-proportion z-test via chi-squared for conversion rate.
Returns: significant (bool), p_value, relative_lift_pct.
"""
contingency = [
[control_conv, control_n - control_conv],
[treatment_conv, treatment_n - treatment_conv],
]
chi2, p_value, dof, expected = chi2_contingency(contingency, correction=False)
control_rate = control_conv / control_n if control_n else 0
treatment_rate = treatment_conv / treatment_n if treatment_n else 0
relative_lift = (treatment_rate - control_rate) / control_rate * 100 if control_rate else 0
return {
'significant': p_value int:
"""
mde_pct: minimum relative lift you want to detect (e.g. 5 for 5%).
Uses standard formula: n ≈ 16 * p*(1-p) / (mde_absolute)^2 for alpha=0.05, power=0.8.
"""
mde_absolute = baseline_rate * mde_pct / 100
p = baseline_rate
n = 16 * p * (1 - p) / (mde_absolute ** 2)
return int(np.ceil(n))
# required_sample_size(0.05, 10) → ~3040 per variant to detect a 10% relative lift at 5% baseline
Key Design Decisions
- Two-hash enrollment: separate hashes for enrollment gate and variant assignment. Without this, a user near the 50% enrollment boundary might get inconsistent treatment as traffic_pct is adjusted mid-experiment. The enroll hash and assign hash are independent seeds.
- INSERT ON CONFLICT DO NOTHING + re-read: two concurrent requests for the same user race to assign — both attempt INSERT, one wins, the other’s DO NOTHING causes re-read to return the already-written row. Ensures exactly-one assignment without distributed locks.
- No assignment in event track: if a user has no assignment record, track_event silently drops the event. This prevents polluting results with users who joined mid-experiment or hit a different server.
- Sample size before launch: run required_sample_size() before starting. At 5% baseline conversion and wanting to detect a 10% relative lift, need ~3,040 users per variant. With 10K daily active users and 100% traffic allocation, the experiment concludes in <1 day.
A/B experiment and feature testing system design is discussed in Netflix system design interview questions.
A/B experiment and conversion optimization design is covered in Amazon system design interview preparation.
A/B experiment and product experimentation design is discussed in Twitter system design interview guide.