How is statistical significance calculated for a conversion rate experiment?

For a binary conversion metric, a two-proportion z-test computes the z-statistic as (p_t - p_c) / sqrt(p_pooled * (1 - p_pooled) * (1/n_t + 1/n_c)), where p_pooled is the combined conversion rate across both groups. The resulting p-value is compared against a pre-specified alpha (typically 0.05) to determine whether to reject the null hypothesis of no difference.

What is sequential testing and how does it prevent false positives from peeking?

Sequential testing (e.g., using the mSPRT or always-valid p-value framework) adjusts the significance threshold at each interim look so that the familywise error rate is controlled regardless of how many times the data are checked. Unlike fixed-horizon tests where peeking inflates false positive rates, sequential tests spend the alpha budget proportionally across looks, allowing early stopping without losing statistical validity.

How is minimum detectable effect used in sample size calculation?

The minimum detectable effect (MDE) is the smallest relative or absolute change in the metric that the experiment is powered to detect at a given alpha and power (typically 80%); the required sample size scales as n ≈ 2 * (z_alpha + z_beta)^2 * p*(1-p) / MDE^2 for a proportion metric. Setting a smaller MDE requires exponentially more samples, so MDE is chosen based on the business's minimum meaningful improvement rather than the smallest possible change.

How does segmented analysis reveal heterogeneous treatment effects?

Segmented analysis splits the experiment population by pre-specified dimensions (e.g., platform, country, user tenure) and computes treatment effect estimates independently within each segment to detect subgroups where the treatment helps or harms more than average. Because running many segment tests inflates the false discovery rate, corrections such as Benjamini-Hochberg or pre-registration of segments of interest are applied before drawing conclusions.

Experiment Logging Service Low-Level Design: Metric Collection, Statistical Analysis, and Results Dashboard

⏱ 5 min read

Experiment Logging Service: Low-Level Design

The experiment logging service collects exposure and conversion events, computes metric values per variant, runs statistical tests to determine significance, and surfaces results in a dashboard. It is the analytical backbone of an experimentation platform.

Event Types

Exposure event — logged when a user is assigned to a variant and encounters the treatment. Fields: user_id, experiment_id, variant, timestamp
Conversion event — logged when a user completes a goal metric (purchase, signup, click, etc.). Fields: user_id, event_type, value (e.g., revenue amount), timestamp

Attribution: a conversion is attributed to an experiment if the user had an exposure event for that experiment within the attribution window (typically 7–30 days).

Metric Types

Binary (proportion) — conversion rate: did the user convert? (0 or 1 per user)
Continuous — revenue per user, session duration, page load time; can take any numeric value
Ratio — click-through rate (clicks / impressions); requires special handling because numerator and denominator vary independently per user

Data Collection Pipeline

Events flow through: client SDK → Kafka topic → batch consumer → experiment metrics table in the warehouse. The pipeline runs hourly for most experiments, with a near-real-time mode (5-minute lag) for critical launches. The metrics table stores one row per user per experiment per day: (experiment_id, variant, user_id, date, conversions, revenue, exposures).

Metrics Computation

For each experiment and variant, the pipeline computes:

Sample size (n) — unique users exposed
Mean (μ) — average metric value per user
Variance (σ²) — spread of metric values across users
Standard error — σ / √n

These are stored in a results table and fed into the statistical test layer.

Statistical Tests

Z-test for binary metrics (conversion rates):

p_pool = (conversions_1 + conversions_2) / (n_1 + n_2)
z = (p_1 - p_2) / sqrt(p_pool * (1 - p_pool) * (1/n_1 + 1/n_2))

Compare z to critical value (z = 1.96 for p < 0.05, two-tailed).

T-test for continuous metrics:

t = (μ_1 - μ_2) / sqrt(σ_1²/n_1 + σ_2²/n_2)

Degrees of freedom computed via Welch's approximation. Used for revenue, latency, session duration.

Mann-Whitney U test — non-parametric alternative when metric distribution is heavily skewed (e.g., revenue with large outliers). Does not assume normality.

P-Value Thresholds

Standard experiments: p < 0.05 (5% false positive rate)
Business-critical decisions (pricing, core checkout): p < 0.01
Two-tailed tests used by default (detects both positive and negative effects)

Power Analysis and Sample Size

Before launching, analysts specify:

Minimum detectable effect (MDE) — smallest improvement worth detecting (e.g., 2% conversion rate lift)
Baseline conversion rate — from historical data
Statistical power — 80% standard (20% false negative rate)
Significance level — 0.05

The system computes required sample size per variant and estimated time to reach significance at current traffic volume.

Sequential Testing / SPRT

Standard frequentist tests are not valid for peeking at results mid-experiment — repeated testing inflates the false positive rate. The Sequential Probability Ratio Test (SPRT) is a valid early-stopping method:

Computes a likelihood ratio after each batch of data
Stops when the ratio crosses upper boundary (declare winner) or lower boundary (declare no effect)
Maintains the overall false positive rate regardless of how many times results are checked
Enables faster decisions when effects are large, without inflating error rates

Novelty Effect Detection

New features often see inflated engagement in week 1 due to novelty. The system tracks metrics by week-since-exposure. If week 1 shows a significant lift but weeks 2–4 do not, the result is flagged as a probable novelty effect rather than a real improvement.

Segmented Analysis

After the top-level result, the pipeline automatically breaks down metrics by:

Device type (mobile vs desktop)
Country or region
User segment (new vs returning, power vs casual)
App version

Segmented results help detect heterogeneous treatment effects — cases where the feature helps one segment but harms another.

Automated Alerts

Alert when statistical significance is reached — notify experiment owner to review results
Alert when experiment duration exceeds planned end date without a decision
Alert when a key guardrail metric (e.g., error rate, p99 latency) degrades in the treatment group — automatic pause recommendation

Summary

The experiment logging service combines event collection, metric computation, z-test/t-test significance testing, SPRT for valid early stopping, novelty effect detection, and segmented analysis — giving product teams the statistical confidence to ship or roll back features based on evidence rather than intuition.