Experimentation Platform: Low Level Design
An experimentation platform enables A/B and multivariate tests with rigorous statistical guarantees. This design covers experiment assignment, exposure and metric collection, statistical analysis, sequential testing, and guardrail enforcement.
Data Model
Experiment Table
Experiment
----------
id BIGINT PRIMARY KEY
name VARCHAR NOT NULL
hypothesis TEXT
status ENUM('draft','running','paused','concluded')
start_date DATE
end_date DATE
traffic_pct INT CHECK (traffic_pct BETWEEN 0 AND 100)
created_at TIMESTAMP
Variant Table
Variant
-------
id BIGINT PRIMARY KEY
experiment_id BIGINT REFERENCES Experiment(id)
name VARCHAR NOT NULL
weight INT NOT NULL -- relative weight, e.g. 50/50 or 33/33/34
ExposureEvent Table
ExposureEvent
-------------
id BIGINT PRIMARY KEY
experiment_id BIGINT
variant_id BIGINT
user_id VARCHAR
exposed_at TIMESTAMP
MetricEvent Table
MetricEvent
-----------
id BIGINT PRIMARY KEY
experiment_id BIGINT
user_id VARCHAR
metric_name VARCHAR
value DOUBLE PRECISION
occurred_at TIMESTAMP
Assignment Algorithm
assign(experiment_id, user_id):
1. bucket = hash(experiment_id + user_id) mod 100
2. if bucket >= experiment.traffic_pct: return null (not in experiment)
3. inner = hash(experiment_id + 'variant' + user_id) mod total_weight
4. cumulative = 0
5. for variant in variants ordered by id:
cumulative += variant.weight
if inner < cumulative: return variant
Assignment is deterministic — same user always gets the same variant. Assignments are not stored unless exposure logging is required for audit.
Exposure and Metric Collection
On assignment, the SDK fires an ExposureEvent. Downstream metric events are joined to exposures by (experiment_id, user_id). Only users with a recorded exposure are included in analysis to prevent dilution bias.
Statistical Analysis
Continuous Metrics (t-test)
t = (mean_treatment - mean_control) / sqrt(var_t/n_t + var_c/n_c)
p_value = two_tailed_p(t, df=n_t+n_c-2)
CI_95 = (delta - 1.96*SE, delta + 1.96*SE)
Binary Metrics (chi-squared)
observed = [[conversions_c, n_c - conversions_c],
[conversions_t, n_t - conversions_t]]
chi2, p_value = chi2_contingency(observed)
Minimum Detectable Effect (MDE) and Sample Size
n = (z_alpha/2 + z_beta)^2 * 2 * sigma^2 / MDE^2
z_alpha/2 = 1.96 (95% confidence)
z_beta = 0.84 (80% power)
Pre-experiment power calculations ensure the experiment runs long enough to detect the target effect size. Results before reaching required sample size are flagged as underpowered.
Sequential Testing
Classical fixed-horizon tests inflate false positive rates when peeked at early. The platform uses always-valid inference (mixture sequential probability ratio test) to allow continuous monitoring without inflating Type I error. Experiments can be stopped early for efficacy or futility without penalty.
Guardrail Metrics
GuardrailMetric
---------------
experiment_id BIGINT
metric_name VARCHAR
direction ENUM('higher_better','lower_better')
max_degradation FLOAT DEFAULT 0.05 -- 5%
If a treatment variant degrades a guardrail metric by more than 5% relative to control (with p < 0.05), the experiment is automatically paused and the team is alerted. Common guardrail metrics: page load time, error rate, checkout completion rate.
Novelty Effect Detection
New features often see inflated engagement in the first few days. The platform flags experiments where the treatment effect is significantly higher in the first 20% of the experiment window compared to the remaining period. These results are marked with a novelty warning on the dashboard.
Results Dashboard
Per variant display:
- Sample size (exposed users)
- Metric value (mean or rate)
- Absolute lift vs control
- Relative lift % with 95% CI
- p-value
- Statistical significance badge (significant / not significant / underpowered)
- Sequential test boundary chart
Scale Considerations
- Assignment is stateless and CPU-only — no DB read per assignment after caching experiment configs in Redis.
- Exposure and metric events go to Kafka; aggregated asynchronously into a columnar store (e.g., ClickHouse) for fast slice-and-dice.
- Statistical computations run as scheduled jobs (hourly); results cached for dashboard reads.
- Index
ExposureEvent(experiment_id, user_id)for efficient join to metric events.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering