Low Level Design: Experiment Logging and Analysis Service

Experiment Logging and Analysis Service

An experiment logging and analysis service captures assignment and metric events from running A/B tests, aggregates them into per-variant statistics, and surfaces results through a reporting interface. The key design challenge is handling high-volume event ingestion without blocking on analysis, while keeping results fresh enough to be actionable.

Core Event Schemas

Two event types drive the system. Assignment events record when a user enters an experiment:

assignment_event {
  user_id       BIGINT,
  experiment_id VARCHAR(64),
  variant       VARCHAR(32),   -- e.g. "control", "treatment_a"
  assigned_at   TIMESTAMP,
  context       JSONB          -- device, platform, geo, session_id
}

Metric events record measurable user actions after assignment:

metric_event {
  user_id       BIGINT,
  experiment_id VARCHAR(64),
  event_type    VARCHAR(64),   -- e.g. "purchase", "click", "session_start"
  value         DECIMAL(18,4), -- revenue, duration, or 1.0 for binary events
  timestamp     TIMESTAMP
}

Including experiment_id on metric events allows the pipeline to join them with assignment records without requiring a lookup at query time.

Event Ingestion Pipeline

Both event types are published to Kafka by client-side and server-side producers. Topic partitioning by experiment_id ensures that all events for a single experiment land on the same consumer group, which simplifies stateful aggregation.

Kafka topics:
  experiment.assignments  (partitioned by experiment_id)
  experiment.metrics      (partitioned by experiment_id)

Consumers:
  AssignmentConsumer  -> writes to assignments table in data warehouse
  MetricConsumer      -> writes to metrics table in data warehouse

The data warehouse (Redshift, BigQuery, or ClickHouse depending on scale) is the source of truth. Raw events are append-only and never updated.

Pre-Computed Metrics

Hourly aggregation jobs compute per-experiment, per-variant metrics and write results to a variant_metrics table:

variant_metrics (
  experiment_id VARCHAR(64),
  variant       VARCHAR(32),
  metric_name   VARCHAR(64),
  computed_at   TIMESTAMP,
  user_count    BIGINT,
  event_count   BIGINT,
  total_value   DECIMAL(18,4),
  conversion_rate DECIMAL(8,6),
  mean_value    DECIMAL(18,4),
  variance      DECIMAL(18,4),
  PRIMARY KEY (experiment_id, variant, metric_name, computed_at)
)

Storing variance alongside mean enables online statistical tests without re-scanning raw events.

Funnel Analysis

Funnel analysis measures sequential event completion rates per variant. A funnel is defined as an ordered list of event_type values. The analysis job computes, for each variant, how many users completed each step given that they completed the previous step:

Funnel: [session_start, product_view, add_to_cart, purchase]

Control:
  step 1: 10,000 users
  step 2:  6,200 users (62% from step 1)
  step 3:  2,800 users (45% from step 2)
  step 4:    840 users (30% from step 3)

Treatment A:
  step 1: 10,100 users
  step 2:  6,700 users (66% from step 1)  <-- lift here
  step 3:  3,100 users (46% from step 2)
  step 4:    930 users (30% from step 3)

Events must occur in order and within a configurable attribution window (e.g., 7 days from assignment).

Cohort Comparison Over Time

The cohort comparison view slices variant_metrics by computed_at to show how conversion rate and mean value evolve over the experiment’s lifetime. This surfaces novelty effects (early spike that decays) and helps determine when results have stabilized.

Statistical Significance Reporting

For each metric and variant pair, the reporting service computes:

Two-sample t-test using stored mean and variance to produce a t-statistic and p-value
95% confidence interval on the difference in means
Effect size (Cohen’s d) to distinguish statistical significance from practical significance
Minimum detectable effect reminder based on pre-experiment power analysis

Results are flagged as significant, trending (p < 0.10), or not significant. The dashboard warns when an experiment is stopped too early based on sample size targets set at launch.

Data Pipeline Schedule

Aggregation jobs run hourly via a job scheduler (Airflow or similar). Each job processes only events since the last successful run using a high-water-mark stored in a pipeline_state table. Failed jobs retry with exponential backoff; downstream metrics are not written until the job succeeds, preventing partial results.

Experiment Dashboard

The dashboard API serves pre-computed results from variant_metrics and runs statistical tests at query time (cheap, since variance is stored). Drill-down views allow filtering by user segment (from the assignment context field), time range, and metric type. Experiment metadata (hypothesis, owner, start/end date, target sample size) is stored in an experiments table and joined at render time.

Frequently Asked Questions

What is an experiment logging and analysis service?

An experiment logging and analysis service records which variant of an A/B test (or multivariate test) each user was assigned to, along with downstream metric events (clicks, conversions, revenue) tied to those assignments. It provides the data pipeline and statistical tooling needed to determine whether observed metric differences between variants are real or due to chance. Core components include an assignment service (deterministic bucketing of users into variants), an event logging pipeline (high-throughput, low-latency ingestion of assignment and conversion events), an analysis engine (aggregation, significance testing, confidence interval computation), and a results dashboard. The service must handle millions of concurrent experiments across a large user base without impacting application performance.

How do you log experiment assignments without impacting application performance?

Assignment logging must be asynchronous and non-blocking. The assignment decision itself is made in-process using a deterministic hash function (e.g., murmur3(user_id + experiment_id) % 100) with no external call, so it adds sub-millisecond latency. The assignment event is written to an in-process buffer and flushed asynchronously to a Kafka topic by a background thread. Kafka provides durable, high-throughput ingestion with decoupling from the application; consumers (Flink, Spark Streaming, or batch jobs) process events into an analytical store (ClickHouse, BigQuery, Redshift) without touching the application hot path. Event batching, local buffering with bounded queues, and graceful degradation (drop logs if the buffer is full rather than blocking the request) are essential to ensure zero application impact.

How do you compute statistical significance from experiment event logs?

For binary metrics (conversion rate), use a two-proportion z-test or chi-squared test: compute the conversion rate for control and treatment, calculate the pooled standard error, and derive a p-value and confidence interval. For continuous metrics (revenue per user, latency), use Welch’s t-test which does not assume equal variances. Key considerations: (1) Sample ratio mismatch (SRM) detection — verify that actual assignment counts match expected proportions before trusting results (a chi-squared test on counts); (2) Multiple comparisons correction — apply Bonferroni or Benjamini-Hochberg FDR correction when testing many metrics simultaneously; (3) Sequential testing / always-valid inference — if analysts peek at results before the planned end date, use methods like sequential probability ratio tests (SPRT) or mSPRT to control Type I error under continuous monitoring; (4) Pre-register primary metrics before launch to prevent p-hacking.

How do you detect experiment interference between concurrent A/B tests?

Experiment interference (interaction effects) occurs when two concurrent experiments affect the same users and their effects are not independent — e.g., a UI experiment and a pricing experiment both influence purchase rate. Detection and mitigation strategies: (1) Mutual exclusion layers — partition the user space so users in experiment A are excluded from experiment B using disjoint hash buckets (orthogonal namespacing); (2) Interaction effect testing — run a 2×2 factorial ANOVA on users exposed to both experiments and test the interaction term for significance; (3) Holdout groups — reserve a percentage of users from all experiments as a clean holdout to measure aggregate system-level effects; (4) Experiment dependency graph — maintain metadata about which product surfaces, metrics, and user segments each experiment touches, and flag potential conflicts at experiment creation time; (5) Traffic isolation — for experiments on shared infrastructure (ranking, recommendations), use interleaving or shadow traffic to reduce cross-contamination.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is an experiment logging and analysis service?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “An experiment logging and analysis service records A/B test variant assignments and downstream metric events, then provides statistical tooling to determine whether observed differences are real. Core components are an assignment service, event logging pipeline, analysis engine, and results dashboard.” } }, { “@type”: “Question”, “name”: “How do you log experiment assignments without impacting application performance?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Make the assignment decision in-process with a deterministic hash (no external call), write the event to an in-process buffer, and flush asynchronously to Kafka. Consumers process events into an analytical store. Use bounded queues and drop-on-full to ensure zero blocking impact on the application hot path.” } }, { “@type”: “Question”, “name”: “How do you compute statistical significance from experiment event logs?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Use two-proportion z-test or chi-squared for binary metrics and Welch’s t-test for continuous metrics. Detect sample ratio mismatches, apply multiple comparisons corrections, and use sequential testing methods (SPRT, mSPRT) if analysts monitor results before the planned end date.” } }, { “@type”: “Question”, “name”: “How do you detect experiment interference between concurrent A/B tests?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Use mutual exclusion layers (orthogonal hash namespacing), run 2×2 factorial ANOVA to test interaction terms, maintain holdout groups outside all experiments, build an experiment dependency graph to flag surface/metric conflicts at creation time, and isolate traffic for shared-infrastructure experiments.” } } ] }