Q: How do you calculate the required sample size for an A/B test?

Required inputs: (1) baseline conversion rate (current conversion rate, e.g., 5%), (2) minimum detectable effect (MDE) — the smallest improvement worth detecting (e.g., 10% relative = 5.5% absolute), (3) desired power (typically 80% — 80% chance of detecting the effect if real), (4) significance level (α=0.05). Formula uses the normal approximation. Python: use scipy.stats or an online sample size calculator. Rule of thumb: to detect a 10% relative lift from a 5% baseline at 80% power, you need ~3000 users per variant. Smaller effects need more users: detecting 1% relative lift needs ~300,000 per variant. Estimate experiment duration: sample_size / daily_traffic_eligible_users = days to run.

Q: How do you prevent the peeking problem in A/B testing?

The peeking problem: if you check results every day and stop when p < 0.05, your actual false positive rate is much higher than 5% (can be 20-30%). This is because you get multiple chances to observe an extreme result by chance. Solutions: (1) Pre-register sample size and only analyze once at the end — the simplest approach. (2) Sequential testing (SPRT, mSPRT): mathematically corrects for continuous monitoring by using a different significance threshold. Allows valid early stopping with controlled false positive rate. Used by Booking.com and Netflix. (3) Bayesian A/B testing: update posterior beliefs as data arrives; no concept of "peeking." Report probability that treatment is better by X%. Does not control Type I error in the frequentist sense but is more intuitive.

Question 1

How do you assign users to A/B test variants deterministically?

Accepted Answer

Hash the user_id combined with the experiment_id: bucket = hash(user_id + ":" + experiment_id) mod 100. This gives a bucket 0-99. Assign to variant based on allocation ranges (e.g., 0-49=control, 50-99=treatment for a 50/50 split). Deterministic: the same user always gets the same bucket for the same experiment. Including experiment_id in the hash ensures the same user gets different (uncorrelated) assignments across different experiments. Use a consistent hash (MD5, FNV) not Python's built-in hash (which is salted and non-deterministic). Cache the assignment in Redis (key=assign:{user_id}:{experiment_id}, TTL=24h) after the first lookup to avoid recomputation on every request.

Question 2

What is the difference between assignment and exposure in A/B testing?

Accepted Answer

Assignment: the user is in the experiment population (eligible for a variant). Happens when the user meets the targeting criteria. Exposure: the user actually encountered the variant. Happens when the variant is rendered/served. A user can be assigned to an experiment but never exposed (if they never visit the page where the experiment is active). Analysis should only include exposed users — counting assigned-but-not-exposed users dilutes the treatment effect and causes false negatives. Log an exposure event (ExposureEvent) when the variant is actually served to the user. The analysis pipeline joins conversion events to exposure events, not just assignment records. This is the correct denominator for conversion rate calculations.

Question 3

What is statistical significance and how do you interpret A/B test results?

Accepted Answer

Statistical significance (p-value < 0.05) means: if the variants were truly identical, there is less than 5% probability of observing a difference this large or larger by chance. It does NOT mean the effect is large or practically important. A statistically significant but tiny effect (e.g., 0.01% conversion rate increase) may not be worth shipping. Practical significance: the effect size must be large enough to matter for the business. Confidence interval: a 95% CI of [+0.5%, +2.0%] means the true effect is likely between +0.5% and +2.0% improvement. Never peek at results mid-experiment and stop early when significant — this inflates the false positive rate (sequential testing problem). Run experiments to the pre-calculated sample size.

Question 4

How do you calculate the required sample size for an A/B test?

Accepted Answer

Required inputs: (1) baseline conversion rate (current conversion rate, e.g., 5%), (2) minimum detectable effect (MDE) — the smallest improvement worth detecting (e.g., 10% relative = 5.5% absolute), (3) desired power (typically 80% — 80% chance of detecting the effect if real), (4) significance level (α=0.05). Formula uses the normal approximation. Python: use scipy.stats or an online sample size calculator. Rule of thumb: to detect a 10% relative lift from a 5% baseline at 80% power, you need ~3000 users per variant. Smaller effects need more users: detecting 1% relative lift needs ~300,000 per variant. Estimate experiment duration: sample_size / daily_traffic_eligible_users = days to run.

Question 5

How do you prevent the peeking problem in A/B testing?

Accepted Answer

The peeking problem: if you check results every day and stop when p < 0.05, your actual false positive rate is much higher than 5% (can be 20-30%). This is because you get multiple chances to observe an extreme result by chance. Solutions: (1) Pre-register sample size and only analyze once at the end — the simplest approach. (2) Sequential testing (SPRT, mSPRT): mathematically corrects for continuous monitoring by using a different significance threshold. Allows valid early stopping with controlled false positive rate. Used by Booking.com and Netflix. (3) Bayesian A/B testing: update posterior beliefs as data arrives; no concept of "peeking." Report probability that treatment is better by X%. Does not control Type I error in the frequentist sense but is more intuitive.

A/B Testing Platform Low-Level Design

What is A/B Testing?

Data Model

User Assignment

Exposure Logging

Statistical Analysis

Common Pitfalls

Experiment Configuration Service

Key Design Decisions