Experimentation Platform: Low Level Design
An experimentation platform enables A/B and multivariate tests with rigorous statistical guarantees. This design covers experiment assignment, exposure and metric collection, statistical analysis, sequential testing, and guardrail enforcement.
Data Model
Experiment Table
Experiment
----------
id BIGINT PRIMARY KEY
name VARCHAR NOT NULL
hypothesis TEXT
status ENUM('draft','running','paused','concluded')
start_date DATE
end_date DATE
traffic_pct INT CHECK (traffic_pct BETWEEN 0 AND 100)
created_at TIMESTAMP
Variant Table
Variant
-------
id BIGINT PRIMARY KEY
experiment_id BIGINT REFERENCES Experiment(id)
name VARCHAR NOT NULL
weight INT NOT NULL -- relative weight, e.g. 50/50 or 33/33/34
ExposureEvent Table
ExposureEvent
-------------
id BIGINT PRIMARY KEY
experiment_id BIGINT
variant_id BIGINT
user_id VARCHAR
exposed_at TIMESTAMP
MetricEvent Table
MetricEvent
-----------
id BIGINT PRIMARY KEY
experiment_id BIGINT
user_id VARCHAR
metric_name VARCHAR
value DOUBLE PRECISION
occurred_at TIMESTAMP
Assignment Algorithm
assign(experiment_id, user_id):
1. bucket = hash(experiment_id + user_id) mod 100
2. if bucket >= experiment.traffic_pct: return null (not in experiment)
3. inner = hash(experiment_id + 'variant' + user_id) mod total_weight
4. cumulative = 0
5. for variant in variants ordered by id:
cumulative += variant.weight
if inner < cumulative: return variant
Assignment is deterministic — same user always gets the same variant. Assignments are not stored unless exposure logging is required for audit.
Exposure and Metric Collection
On assignment, the SDK fires an ExposureEvent. Downstream metric events are joined to exposures by (experiment_id, user_id). Only users with a recorded exposure are included in analysis to prevent dilution bias.
Statistical Analysis
Continuous Metrics (t-test)
t = (mean_treatment - mean_control) / sqrt(var_t/n_t + var_c/n_c)
p_value = two_tailed_p(t, df=n_t+n_c-2)
CI_95 = (delta - 1.96*SE, delta + 1.96*SE)
Binary Metrics (chi-squared)
observed = [[conversions_c, n_c - conversions_c],
[conversions_t, n_t - conversions_t]]
chi2, p_value = chi2_contingency(observed)
Minimum Detectable Effect (MDE) and Sample Size
n = (z_alpha/2 + z_beta)^2 * 2 * sigma^2 / MDE^2
z_alpha/2 = 1.96 (95% confidence)
z_beta = 0.84 (80% power)
Pre-experiment power calculations ensure the experiment runs long enough to detect the target effect size. Results before reaching required sample size are flagged as underpowered.
Sequential Testing
Classical fixed-horizon tests inflate false positive rates when peeked at early. The platform uses always-valid inference (mixture sequential probability ratio test) to allow continuous monitoring without inflating Type I error. Experiments can be stopped early for efficacy or futility without penalty.
Guardrail Metrics
GuardrailMetric
---------------
experiment_id BIGINT
metric_name VARCHAR
direction ENUM('higher_better','lower_better')
max_degradation FLOAT DEFAULT 0.05 -- 5%
If a treatment variant degrades a guardrail metric by more than 5% relative to control (with p < 0.05), the experiment is automatically paused and the team is alerted. Common guardrail metrics: page load time, error rate, checkout completion rate.
Novelty Effect Detection
New features often see inflated engagement in the first few days. The platform flags experiments where the treatment effect is significantly higher in the first 20% of the experiment window compared to the remaining period. These results are marked with a novelty warning on the dashboard.
Results Dashboard
Per variant display:
- Sample size (exposed users)
- Metric value (mean or rate)
- Absolute lift vs control
- Relative lift % with 95% CI
- p-value
- Statistical significance badge (significant / not significant / underpowered)
- Sequential test boundary chart
Scale Considerations
- Assignment is stateless and CPU-only — no DB read per assignment after caching experiment configs in Redis.
- Exposure and metric events go to Kafka; aggregated asynchronously into a columnar store (e.g., ClickHouse) for fast slice-and-dice.
- Statistical computations run as scheduled jobs (hourly); results cached for dashboard reads.
- Index
ExposureEvent(experiment_id, user_id)for efficient join to metric events.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does user assignment work in an A/B experimentation platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Assignment uses a deterministic hash of the experiment ID and user ID modulo 100 to produce a bucket. If the bucket falls within the experiment's traffic percentage, the user is enrolled. A second hash determines which variant the user receives based on the variants' cumulative weights.”
}
},
{
“@type”: “Question”,
“name”: “What statistical tests are used to analyze experiment results?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Continuous metrics such as revenue or session duration are analyzed using a two-sample t-test to compute a p-value and 95% confidence interval. Binary metrics such as conversion rate use a chi-squared test. Both produce a p-value that is compared against the significance threshold (typically 0.05).”
}
},
{
“@type”: “Question”,
“name”: “What is sequential testing and why does it matter for experiments?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Sequential testing uses always-valid inference methods such as the mixture sequential probability ratio test (mSPRT) to allow peeking at results during an experiment without inflating the false positive rate. Classical fixed-horizon tests require waiting until the predetermined sample size is reached; sequential testing allows early stopping for efficacy or futility.”
}
},
{
“@type”: “Question”,
“name”: “What are guardrail metrics in an experimentation platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Guardrail metrics are key health indicators that a new variant must not degrade. If a treatment causes a guardrail metric such as error rate or page load time to worsen by more than 5% relative to control with statistical significance, the platform automatically pauses the experiment and sends an alert.”
}
}
]
}
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering