What are the most common A/B testing mistakes?

Six pitfalls: (1) Peeking -- checking daily and stopping when p 40% chance of at least one false positive. Fix: Bonferroni correction or ONE primary metric. (3) Underpowered tests -- too few samples to detect the expected effect. Calculate sample size BEFORE starting. (4) Novelty effects -- users engage more with anything new initially. Run tests at least 2 weeks and check for time trends. (5) Network effects -- treated users interact with control users, leaking the treatment effect. Fix: cluster randomization. (6) Survivorship bias -- only analyzing users who completed a flow, ignoring dropouts. The treatment may cause more dropouts but look better for survivors.

What is the difference between frequentist and Bayesian A/B testing?

Frequentist: compute p-value = probability of observing the data assuming no real difference. If p control) = 97% means 97% probability treatment wins. Advantages: (1) More intuitive (probability treatment is better vs rejecting null). (2) No peeking problem (Bayesian updating is valid at any sample size). (3) Handles small samples better (prior regularizes). (4) Directly answers which is better. Google, Netflix use Bayesian testing. The framework naturally handles sequential testing and continuous monitoring without adjustments. Disadvantage: requires choosing a prior (weakly informative priors usually suffice).

AI/ML Interview: A/B Testing — Statistical Significance, Experiment Design, Sample Size, P-Value, Bayesian Testing

⏱ 5 min read

A/B testing is the gold standard for measuring the impact of product changes and ML model deployments. Understanding experiment design, statistical significance, and common pitfalls is essential for data science and ML engineering interviews. This guide covers the statistical foundations and practical considerations — from experiment setup to results interpretation.

Experiment Design

A well-designed A/B test: (1) Hypothesis — clearly state what you expect. “Adding a product recommendation widget to the checkout page will increase average order value by 5%.” (2) Primary metric — the ONE metric that determines success (average order value). Secondary metrics provide context but do not determine the decision. (3) Guardrail metrics — metrics that must NOT degrade (page load time, conversion rate, error rate). If a guardrail degrades significantly, the test is stopped regardless of the primary metric. (4) Randomization unit — what is randomized: users (most common), sessions, devices, or geographic regions. User-level randomization: each user consistently sees the same variant across sessions. (5) Sample size — calculate before starting: how many users are needed to detect the expected effect with sufficient statistical power (typically 80%). (6) Duration — run long enough to: reach the required sample size, cover a full business cycle (weekday + weekend behavior differs), and account for novelty effects (users may engage differently with new features initially). Minimum: 1 week. Typical: 2-4 weeks. Common mistake: stopping the test early because the result “looks significant” — this inflates false positive rates (the peeking problem).

Statistical Significance and P-Values

After collecting data: is the observed difference between control and treatment real or due to random chance? Null hypothesis (H0): there is no difference (the treatment has no effect). Alternative hypothesis (H1): there is a difference. P-value: the probability of observing a difference at least as extreme as the one measured, assuming H0 is true. If p = 0.05: fail to reject H0. The result is “not statistically significant.” What p-value is NOT: it is NOT the probability that H0 is true. It is NOT the probability that the result is a fluke. It is the probability of the DATA given H0. This distinction matters in interviews. Confidence interval: a 95% CI for the treatment effect means: if we repeated the experiment 100 times, 95 of the intervals would contain the true effect. The CI provides both: whether the effect is significant (does the CI exclude zero?) and the magnitude (how large is the effect?). A CI of [0.5%, 3.0%] for conversion lift means: statistically significant (excludes 0) with an estimated effect between 0.5% and 3.0%. This is more informative than just “p < 0.05."

Sample Size Calculation

Before running a test: calculate the required sample size per group. Inputs: (1) Baseline metric (current conversion rate: 5%). (2) Minimum Detectable Effect (MDE) — the smallest effect worth detecting. Smaller MDE requires more samples. If a 0.1% lift is not actionable, set MDE = 0.5%. (3) Significance level (alpha = 0.05) — probability of false positive (declaring a winner when there is no real difference). (4) Power (1 – beta = 0.80) — probability of detecting a real effect. 80% power means 20% chance of missing a true effect. Formula (for proportions): n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p1 – p2)^2. For a 5% baseline with 0.5% MDE at alpha=0.05 and power=0.80: n ~ 30,000 per group. Total: 60,000 users. At 10,000 users/day: 6-day minimum test duration. In practice: use an online calculator (Evan Miller, Optimizely) or a Python package (statsmodels). Always calculate sample size BEFORE starting. Running underpowered tests (too few samples) wastes time — you will not detect the effect even if it exists.

Common Pitfalls

Pitfalls that lead to wrong conclusions: (1) Peeking — checking results daily and stopping when p < 0.05. With daily checks over 2 weeks, the false positive rate inflates from 5% to 25%+. Solution: pre-commit to the sample size/duration OR use sequential testing (adjusts for multiple looks). (2) Multiple testing — testing 10 metrics and declaring victory when one is significant (p < 0.05). With 10 tests, the probability of at least one false positive is 40%. Solution: Bonferroni correction (divide alpha by number of tests) or designate ONE primary metric. (3) Simpson paradox — the treatment wins overall but loses in every subgroup (or vice versa) due to imbalanced group sizes. Solution: check results in key segments (mobile vs desktop, new vs returning users). (4) Novelty/primacy effects — users engage more with anything new (novelty) or resist change (primacy). The initial effect fades after 1-2 weeks. Solution: run tests for at least 2 weeks and check for time trends. (5) Network effects — if treated users interact with control users (social networks), the treatment effect "leaks" to control. Solution: cluster randomization (randomize at the group/community level, not individual level). (6) Survivorship bias — only analyzing users who completed a flow (ignoring those who dropped out). The treatment may cause more dropouts but look better for survivors.

Bayesian A/B Testing

Bayesian testing provides an alternative to frequentist hypothesis testing. Instead of p-values: compute the posterior probability that the treatment is better than control. Start with a prior belief about the metric (e.g., Beta(1,1) = uniform prior for conversion rate). Update with observed data to get the posterior distribution. Compute: P(treatment > control) = the probability that the treatment conversion rate exceeds control. If P > 95%: the treatment is likely better. Advantages: (1) Interpretable — “there is a 97% probability that the treatment is better” is more intuitive than “p = 0.03 under H0.” (2) No peeking problem — Bayesian updating is valid at any sample size. You can check results daily without inflating error rates. (3) Decision-focused — directly answers “which variant is better?” rather than “can we reject the null?” (4) Handles small samples better — the prior regularizes estimates when data is limited. Disadvantages: (1) Requires choosing a prior (though weakly informative priors are usually sufficient). (2) Less familiar to many stakeholders. (3) No direct equivalent of “power” for sample size planning (use simulation instead). In practice: Google, Netflix, and many companies use Bayesian A/B testing. The Bayesian framework naturally handles: sequential testing, multiple metrics, and continuous monitoring without the peeking adjustments required by frequentist methods.