System Design: A/B Testing Platform — Experiment Assignment, Metric Collection, and Statistical Analysis

Requirements

An A/B testing platform enables product teams to run controlled experiments: show variant A to 50% of users and variant B to the other 50%, collect outcome metrics, and determine which variant wins with statistical confidence. Core requirements: consistent assignment (a user always sees the same variant), bucketing at scale (millions of assignments per second), metric collection (clicks, conversions, revenue), statistical analysis (p-values, confidence intervals), and experiment management (define, launch, stop, archive). A/B testing is how companies like Netflix, Airbnb, and LinkedIn make data-driven product decisions. Netflix runs 250+ concurrent experiments. LinkedIn runs thousands per year. This is a common system design question at data-driven product companies.

Experiment Assignment

Assignment must be: consistent (same user always gets same variant), random (no systematic bias), and fast (< 1ms, called on every page load). Hashing-based assignment: hash(user_id + experiment_id) mod 100 gives a stable bucket number 0-99. Assign users with bucket 0-49 to control, 50-99 to treatment. The hash is deterministic — same inputs always produce the same output. No state required. No database lookup needed. Algorithm: SHA-256 or MurmurHash3 (faster, non-cryptographic). Why add experiment_id to the hash: without it, users in the same 50% bucket would be in the same bucket for every experiment — introducing correlation. Combining user_id and experiment_id ensures independent assignments across experiments. Traffic allocation: flexible — 50/50, 80/20 (ramp test), 33/33/33 (three-way test). Bucket ranges are computed from the allocation percentages. Holdout groups: reserve a permanent holdout (e.g., 5% of users never see any experiment) to measure the cumulative effect of all experiments over time. Mutual exclusion: experiments on the same surface should be mutually exclusive (a user can only be in one experiment at a time). Layer-based architecture: group mutually exclusive experiments into the same layer. Each layer has non-overlapping buckets.

Event Collection and Metric Pipeline

Every user action that might be a metric is logged as an event: page view, click, add to cart, purchase, session duration, error. Event schema: {user_id, experiment_id, variant, event_type, timestamp, properties (JSON)}. Collection path: client SDK → event ingestion API → Kafka → stream processor → metrics store. SDK: JavaScript/mobile SDK intercepts clicks and logs events. Intercepts checkout completion for revenue metrics. Batches events (send every 5 seconds or on page unload) to reduce HTTP requests. Assignment event: logged when a user is first assigned to an experiment. All subsequent events join on user_id + experiment_id. Metric aggregation: for each experiment + variant: count of users exposed, count of conversions, sum of revenue. Pre-aggregate per hour in the stream processor (Flink). Store in ClickHouse: (experiment_id, variant, metric_name, bucket_hour, value, user_count). Dashboard queries aggregate over the desired time window.

Statistical Analysis

Goal: determine if the observed difference between variants is statistically significant or could be due to chance. Two-sample t-test for continuous metrics (revenue per user): compute t-statistic = (mean_A – mean_B) / sqrt(var_A/n_A + var_B/n_B). Convert to p-value. If p < 0.05: reject the null hypothesis (the difference is significant at 95% confidence). Chi-squared test for binary metrics (conversion rate): compare observed vs expected conversion counts. Multiple testing problem: running 10 metrics per experiment and declaring significance at p < 0.05 means a 40% chance of a false positive somewhere. Fix: Bonferroni correction (divide alpha by number of metrics: 0.05/10 = 0.005 threshold per metric) or use False Discovery Rate (Benjamini-Hochberg). Sequential testing: in traditional A/B testing, you must pre-commit to a sample size and not peek at results early — peeking inflates false positive rates. Sequential testing (e.g., always-valid p-values, mSPRT) allows continuous monitoring without inflating error rates. Power analysis: before launching an experiment, compute the required sample size to detect a minimum detectable effect (MDE) with desired power (80%) and significance (95%). Ensures the experiment collects enough data before concluding. Minimum runtime: typically 1-2 full business cycles (1-2 weeks) to account for day-of-week effects.

Experiment Management and Guardrails

Experiment lifecycle: DRAFT → REVIEW → RUNNING → STOPPED → ARCHIVED. Review: product, engineering, and data science sign off before launch. Guardrail metrics: in addition to the primary metric (conversion rate), monitor guardrail metrics that should not regress: latency, error rate, revenue per user. If a guardrail regresses significantly: automatically stop the experiment and alert the team. This prevents a “winning” experiment from harming the business in unmeasured ways. Ramp-up: start with 1% of traffic, verify no errors or guardrail regressions, then ramp to 10%, 50%, 100% over hours or days. Reduces blast radius of bugs. Interaction effects: two experiments running on the same page may interact. The layer system prevents full interaction, but experiments in different layers can still have interactions. CUPED (Controlled-experiment Using Pre-Experiment Data): use the user’s pre-experiment behavior as a covariate to reduce variance and detect smaller effects with the same sample size. Standard practice at advanced experimentation platforms.


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you ensure consistent experiment assignment for the same user across sessions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use deterministic hash-based assignment: hash(user_id + experiment_id) mod 100 gives a stable bucket (0-99) for each user-experiment pair. The bucket never changes for the same user and experiment, so the user always sees the same variant whether they reload, log out, or return days later. This is computed locally with no database lookup. The hash must be consistent across all services and platforms — use the same algorithm (e.g., MurmurHash3) everywhere.”}},{“@type”:”Question”,”name”:”What is the mutual exclusion problem in A/B testing and how do experiment layers solve it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Mutual exclusion: if users can be in multiple experiments simultaneously, the experiments' effects confound each other (you cannot tell which variant caused an observed change). Layers solve this: each layer is an independent traffic partition. An experiment lives in exactly one layer and owns a slice of that layer's traffic. A user is assigned to at most one experiment per layer, but can be in experiments across different layers simultaneously, since different layers test orthogonal features. Google's Overlapping Experiment Infrastructure popularized this approach.”}},{“@type”:”Question”,”name”:”How does CUPED reduce variance in A/B test metric analysis?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”CUPED (Controlled-experiment Using Pre-Experiment Data) uses a pre-experiment covariate (e.g., the same metric measured in the week before the experiment) to reduce variance in the treatment metric. Adjusted metric = metric – theta * (covariate – mean(covariate)), where theta = Cov(metric, covariate) / Var(covariate). Since the covariate is uncorrelated with treatment assignment, this adjustment does not bias the estimate. It typically reduces variance by 40-70%, giving the same statistical power with fewer users or shorter experiment runtime.”}},{“@type”:”Question”,”name”:”How do you handle novelty effects and ramp-up strategy for a new feature experiment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Novelty effect: users engage more with any new feature simply because it is new, inflating treatment metrics. Mitigation: run the experiment for at least 2 full weeks; novelty effects typically decay within a week. Also analyze by user tenure in treatment — if long-exposed users show lower lift than recent entrants, novelty is a factor. Ramp-up: start with 1% traffic to catch crashes and data pipeline issues before full rollout. Use canary metrics (error rate, latency, crash rate) as kill-switch triggers. Progressively increase to 5%, 10%, 50% before analyzing.”}},{“@type”:”Question”,”name”:”What is the minimum detectable effect (MDE) and how does it affect experiment design?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”MDE is the smallest true effect size the experiment is designed to detect with the specified statistical power (typically 80%). A smaller MDE requires more users (longer runtime). Formula: n = (z_alpha/2 + z_beta)^2 * 2 * sigma^2 / delta^2, where delta is the MDE, sigma^2 is metric variance, z_alpha/2 is the critical value for the false positive rate, and z_beta for the false negative rate. In practice: use a sample size calculator, input your baseline metric value, expected relative lift (MDE), significance level (0.05), and power (0.80) to get required users per variant.”}}]}

See also: Databricks Interview Prep

See also: Meta Interview Prep

See also: Netflix Interview Prep

Scroll to Top