Experiment Logging Service Low-Level Design: Metric Collection, Statistical Analysis, and Results Dashboard

Experiment Logging Service: Low-Level Design

The experiment logging service collects exposure and conversion events, computes metric values per variant, runs statistical tests to determine significance, and surfaces results in a dashboard. It is the analytical backbone of an experimentation platform.

Event Types

  • Exposure event — logged when a user is assigned to a variant and encounters the treatment. Fields: user_id, experiment_id, variant, timestamp
  • Conversion event — logged when a user completes a goal metric (purchase, signup, click, etc.). Fields: user_id, event_type, value (e.g., revenue amount), timestamp

Attribution: a conversion is attributed to an experiment if the user had an exposure event for that experiment within the attribution window (typically 7–30 days).

Metric Types

  • Binary (proportion) — conversion rate: did the user convert? (0 or 1 per user)
  • Continuous — revenue per user, session duration, page load time; can take any numeric value
  • Ratio — click-through rate (clicks / impressions); requires special handling because numerator and denominator vary independently per user

Data Collection Pipeline

Events flow through: client SDK → Kafka topic → batch consumer → experiment metrics table in the warehouse. The pipeline runs hourly for most experiments, with a near-real-time mode (5-minute lag) for critical launches. The metrics table stores one row per user per experiment per day: (experiment_id, variant, user_id, date, conversions, revenue, exposures).

Metrics Computation

For each experiment and variant, the pipeline computes:

  • Sample size (n) — unique users exposed
  • Mean (μ) — average metric value per user
  • Variance (σ²) — spread of metric values across users
  • Standard error — σ / √n

These are stored in a results table and fed into the statistical test layer.

Statistical Tests

Z-test for binary metrics (conversion rates):

p_pool = (conversions_1 + conversions_2) / (n_1 + n_2)
z = (p_1 - p_2) / sqrt(p_pool * (1 - p_pool) * (1/n_1 + 1/n_2))

Compare z to critical value (z = 1.96 for p < 0.05, two-tailed).

T-test for continuous metrics:

t = (μ_1 - μ_2) / sqrt(σ_1²/n_1 + σ_2²/n_2)

Degrees of freedom computed via Welch's approximation. Used for revenue, latency, session duration.

Mann-Whitney U test — non-parametric alternative when metric distribution is heavily skewed (e.g., revenue with large outliers). Does not assume normality.

P-Value Thresholds

  • Standard experiments: p < 0.05 (5% false positive rate)
  • Business-critical decisions (pricing, core checkout): p < 0.01
  • Two-tailed tests used by default (detects both positive and negative effects)

Power Analysis and Sample Size

Before launching, analysts specify:

  • Minimum detectable effect (MDE) — smallest improvement worth detecting (e.g., 2% conversion rate lift)
  • Baseline conversion rate — from historical data
  • Statistical power — 80% standard (20% false negative rate)
  • Significance level — 0.05

The system computes required sample size per variant and estimated time to reach significance at current traffic volume.

Sequential Testing / SPRT

Standard frequentist tests are not valid for peeking at results mid-experiment — repeated testing inflates the false positive rate. The Sequential Probability Ratio Test (SPRT) is a valid early-stopping method:

  • Computes a likelihood ratio after each batch of data
  • Stops when the ratio crosses upper boundary (declare winner) or lower boundary (declare no effect)
  • Maintains the overall false positive rate regardless of how many times results are checked
  • Enables faster decisions when effects are large, without inflating error rates

Novelty Effect Detection

New features often see inflated engagement in week 1 due to novelty. The system tracks metrics by week-since-exposure. If week 1 shows a significant lift but weeks 2–4 do not, the result is flagged as a probable novelty effect rather than a real improvement.

Segmented Analysis

After the top-level result, the pipeline automatically breaks down metrics by:

  • Device type (mobile vs desktop)
  • Country or region
  • User segment (new vs returning, power vs casual)
  • App version

Segmented results help detect heterogeneous treatment effects — cases where the feature helps one segment but harms another.

Automated Alerts

  • Alert when statistical significance is reached — notify experiment owner to review results
  • Alert when experiment duration exceeds planned end date without a decision
  • Alert when a key guardrail metric (e.g., error rate, p99 latency) degrades in the treatment group — automatic pause recommendation

Summary

The experiment logging service combines event collection, metric computation, z-test/t-test significance testing, SPRT for valid early stopping, novelty effect detection, and segmented analysis — giving product teams the statistical confidence to ship or roll back features based on evidence rather than intuition.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top