Experiment Logging Service Low-Level Design: Metric Collection, Statistical Analysis, and Results Dashboard

Experiment Logging Service: Low-Level Design

The experiment logging service collects exposure and conversion events, computes metric values per variant, runs statistical tests to determine significance, and surfaces results in a dashboard. It is the analytical backbone of an experimentation platform.

Event Types

  • Exposure event — logged when a user is assigned to a variant and encounters the treatment. Fields: user_id, experiment_id, variant, timestamp
  • Conversion event — logged when a user completes a goal metric (purchase, signup, click, etc.). Fields: user_id, event_type, value (e.g., revenue amount), timestamp

Attribution: a conversion is attributed to an experiment if the user had an exposure event for that experiment within the attribution window (typically 7–30 days).

Metric Types

  • Binary (proportion) — conversion rate: did the user convert? (0 or 1 per user)
  • Continuous — revenue per user, session duration, page load time; can take any numeric value
  • Ratio — click-through rate (clicks / impressions); requires special handling because numerator and denominator vary independently per user

Data Collection Pipeline

Events flow through: client SDK → Kafka topic → batch consumer → experiment metrics table in the warehouse. The pipeline runs hourly for most experiments, with a near-real-time mode (5-minute lag) for critical launches. The metrics table stores one row per user per experiment per day: (experiment_id, variant, user_id, date, conversions, revenue, exposures).

Metrics Computation

For each experiment and variant, the pipeline computes:

  • Sample size (n) — unique users exposed
  • Mean (μ) — average metric value per user
  • Variance (σ²) — spread of metric values across users
  • Standard error — σ / √n

These are stored in a results table and fed into the statistical test layer.

Statistical Tests

Z-test for binary metrics (conversion rates):

p_pool = (conversions_1 + conversions_2) / (n_1 + n_2)
z = (p_1 - p_2) / sqrt(p_pool * (1 - p_pool) * (1/n_1 + 1/n_2))

Compare z to critical value (z = 1.96 for p < 0.05, two-tailed).

T-test for continuous metrics:

t = (μ_1 - μ_2) / sqrt(σ_1²/n_1 + σ_2²/n_2)

Degrees of freedom computed via Welch's approximation. Used for revenue, latency, session duration.

Mann-Whitney U test — non-parametric alternative when metric distribution is heavily skewed (e.g., revenue with large outliers). Does not assume normality.

P-Value Thresholds

  • Standard experiments: p < 0.05 (5% false positive rate)
  • Business-critical decisions (pricing, core checkout): p < 0.01
  • Two-tailed tests used by default (detects both positive and negative effects)

Power Analysis and Sample Size

Before launching, analysts specify:

  • Minimum detectable effect (MDE) — smallest improvement worth detecting (e.g., 2% conversion rate lift)
  • Baseline conversion rate — from historical data
  • Statistical power — 80% standard (20% false negative rate)
  • Significance level — 0.05

The system computes required sample size per variant and estimated time to reach significance at current traffic volume.

Sequential Testing / SPRT

Standard frequentist tests are not valid for peeking at results mid-experiment — repeated testing inflates the false positive rate. The Sequential Probability Ratio Test (SPRT) is a valid early-stopping method:

  • Computes a likelihood ratio after each batch of data
  • Stops when the ratio crosses upper boundary (declare winner) or lower boundary (declare no effect)
  • Maintains the overall false positive rate regardless of how many times results are checked
  • Enables faster decisions when effects are large, without inflating error rates

Novelty Effect Detection

New features often see inflated engagement in week 1 due to novelty. The system tracks metrics by week-since-exposure. If week 1 shows a significant lift but weeks 2–4 do not, the result is flagged as a probable novelty effect rather than a real improvement.

Segmented Analysis

After the top-level result, the pipeline automatically breaks down metrics by:

  • Device type (mobile vs desktop)
  • Country or region
  • User segment (new vs returning, power vs casual)
  • App version

Segmented results help detect heterogeneous treatment effects — cases where the feature helps one segment but harms another.

Automated Alerts

  • Alert when statistical significance is reached — notify experiment owner to review results
  • Alert when experiment duration exceeds planned end date without a decision
  • Alert when a key guardrail metric (e.g., error rate, p99 latency) degrades in the treatment group — automatic pause recommendation

Summary

The experiment logging service combines event collection, metric computation, z-test/t-test significance testing, SPRT for valid early stopping, novelty effect detection, and segmented analysis — giving product teams the statistical confidence to ship or roll back features based on evidence rather than intuition.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is statistical significance calculated for a conversion rate experiment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For a binary conversion metric, a two-proportion z-test computes the z-statistic as (p_t – p_c) / sqrt(p_pooled * (1 – p_pooled) * (1/n_t + 1/n_c)), where p_pooled is the combined conversion rate across both groups. The resulting p-value is compared against a pre-specified alpha (typically 0.05) to determine whether to reject the null hypothesis of no difference.”
}
},
{
“@type”: “Question”,
“name”: “What is sequential testing and how does it prevent false positives from peeking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Sequential testing (e.g., using the mSPRT or always-valid p-value framework) adjusts the significance threshold at each interim look so that the familywise error rate is controlled regardless of how many times the data are checked. Unlike fixed-horizon tests where peeking inflates false positive rates, sequential tests spend the alpha budget proportionally across looks, allowing early stopping without losing statistical validity.”
}
},
{
“@type”: “Question”,
“name”: “How is minimum detectable effect used in sample size calculation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The minimum detectable effect (MDE) is the smallest relative or absolute change in the metric that the experiment is powered to detect at a given alpha and power (typically 80%); the required sample size scales as n ≈ 2 * (z_alpha + z_beta)^2 * p*(1-p) / MDE^2 for a proportion metric. Setting a smaller MDE requires exponentially more samples, so MDE is chosen based on the business's minimum meaningful improvement rather than the smallest possible change.”
}
},
{
“@type”: “Question”,
“name”: “How does segmented analysis reveal heterogeneous treatment effects?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Segmented analysis splits the experiment population by pre-specified dimensions (e.g., platform, country, user tenure) and computes treatment effect estimates independently within each segment to detect subgroups where the treatment helps or harms more than average. Because running many segment tests inflates the false discovery rate, corrections such as Benjamini-Hochberg or pre-registration of segments of interest are applied before drawing conclusions.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top