Experiment Logging Service: Low-Level Design
The experiment logging service collects exposure and conversion events, computes metric values per variant, runs statistical tests to determine significance, and surfaces results in a dashboard. It is the analytical backbone of an experimentation platform.
Event Types
- Exposure event — logged when a user is assigned to a variant and encounters the treatment. Fields:
user_id,experiment_id,variant,timestamp - Conversion event — logged when a user completes a goal metric (purchase, signup, click, etc.). Fields:
user_id,event_type,value(e.g., revenue amount),timestamp
Attribution: a conversion is attributed to an experiment if the user had an exposure event for that experiment within the attribution window (typically 7–30 days).
Metric Types
- Binary (proportion) — conversion rate: did the user convert? (0 or 1 per user)
- Continuous — revenue per user, session duration, page load time; can take any numeric value
- Ratio — click-through rate (clicks / impressions); requires special handling because numerator and denominator vary independently per user
Data Collection Pipeline
Events flow through: client SDK → Kafka topic → batch consumer → experiment metrics table in the warehouse. The pipeline runs hourly for most experiments, with a near-real-time mode (5-minute lag) for critical launches. The metrics table stores one row per user per experiment per day: (experiment_id, variant, user_id, date, conversions, revenue, exposures).
Metrics Computation
For each experiment and variant, the pipeline computes:
- Sample size (n) — unique users exposed
- Mean (μ) — average metric value per user
- Variance (σ²) — spread of metric values across users
- Standard error — σ / √n
These are stored in a results table and fed into the statistical test layer.
Statistical Tests
Z-test for binary metrics (conversion rates):
p_pool = (conversions_1 + conversions_2) / (n_1 + n_2)
z = (p_1 - p_2) / sqrt(p_pool * (1 - p_pool) * (1/n_1 + 1/n_2))
Compare z to critical value (z = 1.96 for p < 0.05, two-tailed).
T-test for continuous metrics:
t = (μ_1 - μ_2) / sqrt(σ_1²/n_1 + σ_2²/n_2)
Degrees of freedom computed via Welch's approximation. Used for revenue, latency, session duration.
Mann-Whitney U test — non-parametric alternative when metric distribution is heavily skewed (e.g., revenue with large outliers). Does not assume normality.
P-Value Thresholds
- Standard experiments: p < 0.05 (5% false positive rate)
- Business-critical decisions (pricing, core checkout): p < 0.01
- Two-tailed tests used by default (detects both positive and negative effects)
Power Analysis and Sample Size
Before launching, analysts specify:
- Minimum detectable effect (MDE) — smallest improvement worth detecting (e.g., 2% conversion rate lift)
- Baseline conversion rate — from historical data
- Statistical power — 80% standard (20% false negative rate)
- Significance level — 0.05
The system computes required sample size per variant and estimated time to reach significance at current traffic volume.
Sequential Testing / SPRT
Standard frequentist tests are not valid for peeking at results mid-experiment — repeated testing inflates the false positive rate. The Sequential Probability Ratio Test (SPRT) is a valid early-stopping method:
- Computes a likelihood ratio after each batch of data
- Stops when the ratio crosses upper boundary (declare winner) or lower boundary (declare no effect)
- Maintains the overall false positive rate regardless of how many times results are checked
- Enables faster decisions when effects are large, without inflating error rates
Novelty Effect Detection
New features often see inflated engagement in week 1 due to novelty. The system tracks metrics by week-since-exposure. If week 1 shows a significant lift but weeks 2–4 do not, the result is flagged as a probable novelty effect rather than a real improvement.
Segmented Analysis
After the top-level result, the pipeline automatically breaks down metrics by:
- Device type (mobile vs desktop)
- Country or region
- User segment (new vs returning, power vs casual)
- App version
Segmented results help detect heterogeneous treatment effects — cases where the feature helps one segment but harms another.
Automated Alerts
- Alert when statistical significance is reached — notify experiment owner to review results
- Alert when experiment duration exceeds planned end date without a decision
- Alert when a key guardrail metric (e.g., error rate, p99 latency) degrades in the treatment group — automatic pause recommendation
Summary
The experiment logging service combines event collection, metric computation, z-test/t-test significance testing, SPRT for valid early stopping, novelty effect detection, and segmented analysis — giving product teams the statistical confidence to ship or roll back features based on evidence rather than intuition.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety