A/B Experiment System: Low-Level Design
An A/B experiment system assigns users to treatment groups, measures whether a metric changes between groups, and decides whether to ship or revert. It differs from a feature flag system in one key way: assignments must be logged and frozen — a user must stay in the same variant for the entire experiment duration, even if they clear cookies or switch devices. Statistical validity depends on stable assignment.
Core Data Model
CREATE TABLE Experiment (
experiment_id SERIAL PRIMARY KEY,
experiment_key VARCHAR(100) UNIQUE NOT NULL, -- "checkout_button_color"
description TEXT,
status VARCHAR(20) NOT NULL DEFAULT 'draft', -- draft, running, paused, concluded
traffic_pct SMALLINT NOT NULL DEFAULT 100, -- % of eligible users enrolled
started_at TIMESTAMPTZ,
concluded_at TIMESTAMPTZ,
winning_variant VARCHAR(50),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE ExperimentVariant (
variant_id SERIAL PRIMARY KEY,
experiment_id INT NOT NULL REFERENCES Experiment(experiment_id) ON DELETE CASCADE,
variant_key VARCHAR(50) NOT NULL, -- "control", "treatment_a", "treatment_b"
weight SMALLINT NOT NULL DEFAULT 50, -- relative traffic weight
UNIQUE (experiment_id, variant_key)
);
CREATE TABLE ExperimentAssignment (
assignment_id BIGSERIAL PRIMARY KEY,
experiment_id INT NOT NULL REFERENCES Experiment(experiment_id),
user_id BIGINT NOT NULL,
variant_id INT NOT NULL REFERENCES ExperimentVariant(variant_id),
assigned_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (experiment_id, user_id) -- one assignment per user per experiment
);
CREATE TABLE ExperimentEvent (
event_id BIGSERIAL PRIMARY KEY,
experiment_id INT NOT NULL,
user_id BIGINT NOT NULL,
variant_id INT NOT NULL,
event_type VARCHAR(100) NOT NULL, -- "page_view", "checkout_complete", "revenue"
value NUMERIC(12,4), -- for metric events (revenue amount)
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Partition ExperimentEvent by month — events accumulate quickly
CREATE INDEX ON ExperimentAssignment(experiment_id, variant_id);
CREATE INDEX ON ExperimentEvent(experiment_id, event_type, occurred_at);
Assignment Engine
import hashlib, json
from typing import Optional
def get_variant(experiment_key: str, user_id: int) -> Optional[str]:
"""
Returns the variant key for this user, or None if not enrolled.
Assignment is deterministic and sticky: same user always gets the same variant.
"""
exp = _load_experiment(experiment_key)
if not exp or exp['status'] != 'running':
return None
# Step 1: enrollment gate — only traffic_pct% of users enter the experiment
enroll_hash = int(hashlib.md5(f"enroll:{experiment_key}:{user_id}".encode()).hexdigest()[:8], 16) % 100
if enroll_hash >= exp['traffic_pct']:
return None # not enrolled
# Step 2: check for existing assignment (sticky)
existing = db.fetchone(
"""SELECT v.variant_key FROM ExperimentAssignment a
JOIN ExperimentVariant v USING (variant_id)
WHERE a.experiment_id=%s AND a.user_id=%s""",
(exp['experiment_id'], user_id)
)
if existing:
return existing['variant_key']
# Step 3: assign to variant by weighted hash
variant = _assign_variant(experiment_key, user_id, exp['variants'])
# Step 4: persist assignment (INSERT ... ON CONFLICT DO NOTHING handles race)
db.execute("""
INSERT INTO ExperimentAssignment (experiment_id, user_id, variant_id)
VALUES (%s, %s, %s) ON CONFLICT DO NOTHING
""", (exp['experiment_id'], user_id, variant['variant_id']))
# Re-read in case another request raced us
row = db.fetchone(
"""SELECT v.variant_key FROM ExperimentAssignment a
JOIN ExperimentVariant v USING (variant_id)
WHERE a.experiment_id=%s AND a.user_id=%s""",
(exp['experiment_id'], user_id)
)
return row['variant_key'] if row else variant['variant_key']
def _assign_variant(experiment_key: str, user_id: int, variants: list) -> dict:
"""
Weighted assignment: variants have weights [50, 50] → even split.
Hash(experiment_key + user_id) % total_weight maps to a variant bucket.
"""
total_weight = sum(v['weight'] for v in variants)
bucket = int(hashlib.md5(f"assign:{experiment_key}:{user_id}".encode()).hexdigest()[:8], 16) % total_weight
cumulative = 0
for v in sorted(variants, key=lambda x: x['variant_id']):
cumulative += v['weight']
if bucket < cumulative:
return v
return variants[-1] # fallback
Event Tracking
def track_event(experiment_key: str, user_id: int, event_type: str, value: float = None):
"""
Track a metric event for a user. Only records if the user is in the experiment.
Called from application code on meaningful actions.
"""
exp = _load_experiment(experiment_key)
if not exp or exp['status'] != 'running':
return
assignment = db.fetchone(
"SELECT variant_id FROM ExperimentAssignment WHERE experiment_id=%s AND user_id=%s",
(exp['experiment_id'], user_id)
)
if not assignment:
return # user not enrolled — don't record
db.execute("""
INSERT INTO ExperimentEvent (experiment_id, user_id, variant_id, event_type, value, occurred_at)
VALUES (%s, %s, %s, %s, %s, NOW())
""", (exp['experiment_id'], user_id, assignment['variant_id'], event_type, value))
# Application usage:
# variant = get_variant('checkout_button_color', user.id)
# if variant == 'treatment_green':
# render_green_button()
# else:
# render_blue_button() # control
# ...
# On conversion:
# track_event('checkout_button_color', user.id, 'checkout_complete')
# track_event('checkout_button_color', user.id, 'revenue', value=order.total)
Results Aggregation Query
-- Per-variant metrics for an experiment
SELECT
v.variant_key,
COUNT(DISTINCT a.user_id) AS users,
COUNT(DISTINCT CASE WHEN e.event_type='checkout_complete'
THEN e.user_id END) AS conversions,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN e.event_type='checkout_complete'
THEN e.user_id END)
/ NULLIF(COUNT(DISTINCT a.user_id), 0), 2) AS conversion_rate_pct,
ROUND(SUM(CASE WHEN e.event_type='revenue' THEN e.value ELSE 0 END)
/ NULLIF(COUNT(DISTINCT a.user_id), 0), 4) AS revenue_per_user
FROM ExperimentVariant v
JOIN ExperimentAssignment a USING (variant_id)
LEFT JOIN ExperimentEvent e ON e.experiment_id=v.experiment_id
AND e.user_id=a.user_id AND e.event_type IN ('checkout_complete','revenue')
WHERE v.experiment_id = $experiment_id
GROUP BY v.variant_key
ORDER BY v.variant_key;
Statistical Significance Check
from scipy.stats import chi2_contingency, ttest_ind
import numpy as np
def check_significance(control_n, control_conv, treatment_n, treatment_conv,
alpha=0.05) -> dict:
"""
Two-proportion z-test via chi-squared for conversion rate.
Returns: significant (bool), p_value, relative_lift_pct.
"""
contingency = [
[control_conv, control_n - control_conv],
[treatment_conv, treatment_n - treatment_conv],
]
chi2, p_value, dof, expected = chi2_contingency(contingency, correction=False)
control_rate = control_conv / control_n if control_n else 0
treatment_rate = treatment_conv / treatment_n if treatment_n else 0
relative_lift = (treatment_rate - control_rate) / control_rate * 100 if control_rate else 0
return {
'significant': p_value int:
"""
mde_pct: minimum relative lift you want to detect (e.g. 5 for 5%).
Uses standard formula: n ≈ 16 * p*(1-p) / (mde_absolute)^2 for alpha=0.05, power=0.8.
"""
mde_absolute = baseline_rate * mde_pct / 100
p = baseline_rate
n = 16 * p * (1 - p) / (mde_absolute ** 2)
return int(np.ceil(n))
# required_sample_size(0.05, 10) → ~3040 per variant to detect a 10% relative lift at 5% baseline
Key Design Decisions
- Two-hash enrollment: separate hashes for enrollment gate and variant assignment. Without this, a user near the 50% enrollment boundary might get inconsistent treatment as traffic_pct is adjusted mid-experiment. The enroll hash and assign hash are independent seeds.
- INSERT ON CONFLICT DO NOTHING + re-read: two concurrent requests for the same user race to assign — both attempt INSERT, one wins, the other’s DO NOTHING causes re-read to return the already-written row. Ensures exactly-one assignment without distributed locks.
- No assignment in event track: if a user has no assignment record, track_event silently drops the event. This prevents polluting results with users who joined mid-experiment or hit a different server.
- Sample size before launch: run required_sample_size() before starting. At 5% baseline conversion and wanting to detect a 10% relative lift, need ~3,040 users per variant. With 10K daily active users and 100% traffic allocation, the experiment concludes in <1 day.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why use two separate hashes for enrollment and variant assignment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”One hash for both enrollment and variant creates a correlation: users in the "enrolled" set are those whose hash falls below traffic_pct, and within that set, the variant assignment reuses the same hash space. This can create non-uniform variant splits. Example: traffic_pct=10, two variants 50/50. With one hash, users with hash 0-9 are enrolled; users 0-4 get variant A, 5-9 get variant B. But this means users at the enrollment boundary (hash=8,9) always get variant B and are never available for variant A — the variant split within the enrolled group is correct but the enrolled group itself is a biased slice of the population. Two independent hashes avoid this: enrollment_hash = MD5("enroll:key:user") % 100, variant_hash = MD5("assign:key:user") % total_weight. The enrolled population is hash-uniform; the variant assignment within it is also hash-uniform and independent.”}},{“@type”:”Question”,”name”:”How do you prevent novelty effect bias from inflating treatment metrics?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The novelty effect: users interact more with a new UI simply because it is new — engagement metrics spike in week 1 and decay back to baseline by week 3. If you conclude an experiment after 3 days showing a 20% lift, you are measuring the novelty effect, not the true effect. Mitigation: (1) run the experiment for at least 2 full business cycles (2 weeks minimum for weekly-cycle products); (2) analyze the time trend within the treatment group — if lift is decaying week-over-week, the novelty effect is present; (3) exclude users in their first N days of product use from the experiment — new users have their own novelty effects unrelated to the feature; (4) hold-out cohort: track the treatment group for 30 days post-assignment even if the feature ships — compare week-1 vs. week-4 metrics to measure novelty decay.”}},{“@type”:”Question”,”name”:”How do you handle a user who clears cookies and gets a new session mid-experiment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cookie clearing creates a new anonymous session — the experiment system assigns the user to a variant based on user_id, not session. If the user is authenticated, the assignment lookup is by user_id and is stable regardless of session state. The UNIQUE(experiment_id, user_id) constraint ensures the database assignment record persists across sessions. The problem arises for unauthenticated experiments: if assignment is session-based, a new session means a new bucket. For unauthenticated A/B testing: use a long-lived cookie (1-year expiry) as the stable identifier, not a session cookie. On cookie clear, the user gets a new assignment — acceptable for most web experiments (the impact is small — most users don’t clear cookies). If stable assignment for unauthenticated users is critical (e.g. pricing experiment), require authentication before enrollment.”}},{“@type”:”Question”,”name”:”What is the minimum sample size needed before concluding an experiment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Sample size depends on three inputs: (1) baseline conversion rate (p); (2) minimum detectable effect (MDE) — the smallest relative lift worth detecting; (3) statistical power (typically 80%). Formula: n ≈ 16 * p * (1-p) / (MDE_absolute)^2 per variant for α=0.05, power=0.80. Examples: p=5% baseline, 10% relative MDE (detect 0.5pp lift) → n≈3,040/variant; p=5%, 5% relative MDE (detect 0.25pp lift) → n≈12,160/variant. Running underpowered experiments (stopping at 200 users) leads to false negatives (missing real effects) or false positives (peeking at p-value and stopping when it first crosses 0.05). The "peeking problem": if you check significance 10 times during the experiment, your true false positive rate is ~40%, not 5%. Fix: use Sequential testing methods (mSPRT) that allow valid early stopping, or pre-commit to a sample size and check only once.”}},{“@type”:”Question”,”name”:”How do you measure the long-term impact of an experiment beyond the test period?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Short-run experiment metrics (7-day conversion) may not capture long-run effects (30-day retention, 6-month LTV). A treatment that improves checkout conversion by 5% might reduce retention by 2% — you shipped a net-negative change. Long-run measurement: (1) holdback group — keep 5% of users in the control permanently after shipping, even after 100% rollout. Compare holdback vs. treatment users at 30, 60, 90 days. (2) Cohort analysis — track the assignment cohorts in the data warehouse: SELECT variant, AVG(ltv_90d) FROM ExperimentAssignment JOIN UserLTV USING (user_id) WHERE experiment_id=$id GROUP BY variant. (3) Observational analysis — after the experiment, the assignment record exists in the database. Run the regression 6 months later: users assigned to treatment vs. control, controlling for confounders. This is cheaper than maintaining a live holdback but less rigorous.”}}]}
A/B experiment and feature testing system design is discussed in Netflix system design interview questions.
A/B experiment and conversion optimization design is covered in Amazon system design interview preparation.
A/B experiment and product experimentation design is discussed in Twitter system design interview guide.