Low Level Design: Experimentation Platform

Experimentation Platform: Low Level Design

An experimentation platform enables A/B and multivariate tests with rigorous statistical guarantees. This design covers experiment assignment, exposure and metric collection, statistical analysis, sequential testing, and guardrail enforcement.

Data Model

Experiment Table

Experiment
----------
id          BIGINT PRIMARY KEY
name        VARCHAR NOT NULL
hypothesis  TEXT
status      ENUM('draft','running','paused','concluded')
start_date  DATE
end_date    DATE
traffic_pct INT CHECK (traffic_pct BETWEEN 0 AND 100)
created_at  TIMESTAMP

Variant Table

Variant
-------
id            BIGINT PRIMARY KEY
experiment_id BIGINT REFERENCES Experiment(id)
name          VARCHAR NOT NULL
weight        INT NOT NULL   -- relative weight, e.g. 50/50 or 33/33/34

ExposureEvent Table

ExposureEvent
-------------
id            BIGINT PRIMARY KEY
experiment_id BIGINT
variant_id    BIGINT
user_id       VARCHAR
exposed_at    TIMESTAMP

MetricEvent Table

MetricEvent
-----------
id            BIGINT PRIMARY KEY
experiment_id BIGINT
user_id       VARCHAR
metric_name   VARCHAR
value         DOUBLE PRECISION
occurred_at   TIMESTAMP

Assignment Algorithm

assign(experiment_id, user_id):
  1. bucket = hash(experiment_id + user_id) mod 100
  2. if bucket >= experiment.traffic_pct: return null (not in experiment)
  3. inner = hash(experiment_id + 'variant' + user_id) mod total_weight
  4. cumulative = 0
  5. for variant in variants ordered by id:
       cumulative += variant.weight
       if inner < cumulative: return variant

Assignment is deterministic — same user always gets the same variant. Assignments are not stored unless exposure logging is required for audit.

Exposure and Metric Collection

On assignment, the SDK fires an ExposureEvent. Downstream metric events are joined to exposures by (experiment_id, user_id). Only users with a recorded exposure are included in analysis to prevent dilution bias.

Statistical Analysis

Continuous Metrics (t-test)

t = (mean_treatment - mean_control) / sqrt(var_t/n_t + var_c/n_c)
p_value = two_tailed_p(t, df=n_t+n_c-2)
CI_95 = (delta - 1.96*SE, delta + 1.96*SE)

Binary Metrics (chi-squared)

observed = [[conversions_c, n_c - conversions_c],
            [conversions_t, n_t - conversions_t]]
chi2, p_value = chi2_contingency(observed)

Minimum Detectable Effect (MDE) and Sample Size

n = (z_alpha/2 + z_beta)^2 * 2 * sigma^2 / MDE^2

z_alpha/2 = 1.96  (95% confidence)
z_beta    = 0.84  (80% power)

Pre-experiment power calculations ensure the experiment runs long enough to detect the target effect size. Results before reaching required sample size are flagged as underpowered.

Sequential Testing

Classical fixed-horizon tests inflate false positive rates when peeked at early. The platform uses always-valid inference (mixture sequential probability ratio test) to allow continuous monitoring without inflating Type I error. Experiments can be stopped early for efficacy or futility without penalty.

Guardrail Metrics

GuardrailMetric
---------------
experiment_id   BIGINT
metric_name     VARCHAR
direction       ENUM('higher_better','lower_better')
max_degradation FLOAT DEFAULT 0.05  -- 5%

If a treatment variant degrades a guardrail metric by more than 5% relative to control (with p < 0.05), the experiment is automatically paused and the team is alerted. Common guardrail metrics: page load time, error rate, checkout completion rate.

Novelty Effect Detection

New features often see inflated engagement in the first few days. The platform flags experiments where the treatment effect is significantly higher in the first 20% of the experiment window compared to the remaining period. These results are marked with a novelty warning on the dashboard.

Results Dashboard

Per variant display:
- Sample size (exposed users)
- Metric value (mean or rate)
- Absolute lift vs control
- Relative lift % with 95% CI
- p-value
- Statistical significance badge (significant / not significant / underpowered)
- Sequential test boundary chart

Scale Considerations

Assignment is stateless and CPU-only — no DB read per assignment after caching experiment configs in Redis.
Exposure and metric events go to Kafka; aggregated asynchronously into a columnar store (e.g., ClickHouse) for fast slice-and-dice.
Statistical computations run as scheduled jobs (hourly); results cached for dashboard reads.
Index ExposureEvent(experiment_id, user_id) for efficient join to metric events.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does user assignment work in an A/B experimentation platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Assignment uses a deterministic hash of the experiment ID and user ID modulo 100 to produce a bucket. If the bucket falls within the experiment's traffic percentage, the user is enrolled. A second hash determines which variant the user receives based on the variants' cumulative weights.”
}
},
{
“@type”: “Question”,
“name”: “What statistical tests are used to analyze experiment results?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Continuous metrics such as revenue or session duration are analyzed using a two-sample t-test to compute a p-value and 95% confidence interval. Binary metrics such as conversion rate use a chi-squared test. Both produce a p-value that is compared against the significance threshold (typically 0.05).”
}
},
{
“@type”: “Question”,
“name”: “What is sequential testing and why does it matter for experiments?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Sequential testing uses always-valid inference methods such as the mixture sequential probability ratio test (mSPRT) to allow peeking at results during an experiment without inflating the false positive rate. Classical fixed-horizon tests require waiting until the predetermined sample size is reached; sequential testing allows early stopping for efficacy or futility.”
}
},
{
“@type”: “Question”,
“name”: “What are guardrail metrics in an experimentation platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Guardrail metrics are key health indicators that a new variant must not degrade. If a treatment causes a guardrail metric such as error rate or page load time to worsen by more than 5% relative to control with statistical significance, the platform automatically pauses the experiment and sends an alert.”
}
}
]
}