What is A/B Testing?
A/B testing (controlled experimentation) randomly assigns users to experimental variants and measures which variant produces better outcomes. Core components: experiment configuration (which users, which variants, what percentage), assignment service (deterministic user-to-variant mapping), event tracking (log which users saw which variant and what they did), and statistical analysis (is the difference significant?).
Data Model
Experiment(experiment_id, name, status ENUM(DRAFT,RUNNING,PAUSED,CONCLUDED),
start_date, end_date, owner, description)
Variant(variant_id, experiment_id, name, allocation_percent, config JSON)
Assignment(user_id, experiment_id, variant_id, assigned_at)
ExperimentMetric(experiment_id, metric_name, primary BOOL)
ExposureEvent(user_id, experiment_id, variant_id, timestamp, properties JSON)
ConversionEvent(user_id, event_type, timestamp, value)
User Assignment
Requirements: deterministic (same user always gets same variant), random (no systematic bias), consistent (variant doesn’t change mid-experiment). Implementation: hash(user_id + experiment_id) mod 100 → assign to variant based on allocation ranges.
import hashlib
def assign_variant(user_id, experiment_id, variants):
# Deterministic hash: same user_id + experiment_id always gives same bucket
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = hash_value % 100 # 0-99
cumulative = 0
for variant in variants:
cumulative += variant.allocation_percent
if bucket < cumulative:
return variant
return variants[-1] # fallback
Store the assignment in the Assignment table on first encounter. Cache in Redis (key=assign:{user_id}:{experiment_id}, TTL=24h) for fast lookups on subsequent requests.
Exposure Logging
Log an exposure event when the user actually sees the variant (not just when they’re assigned). Distinction: a user may be assigned to experiment E but only exposed if they visit the page where E is active. Log exposure: when the variant is rendered, publish ExposureEvent to Kafka. Only users with logged exposures are included in analysis. This prevents dilution (including unexposed users in the control group).
Statistical Analysis
For conversion metrics: two-proportion z-test. For continuous metrics (revenue, session duration): t-test or Mann-Whitney U test. Key concepts:
- p-value: probability of seeing this large a difference by chance if the null hypothesis is true. Threshold p < 0.05 (5% false positive rate).
- Statistical significance: reject the null hypothesis (no effect). Does NOT mean the effect is practically significant.
- Power: probability of detecting a real effect. Power = 80% means 20% chance of missing a real effect. Need sufficient sample size for adequate power.
- Sample size calculation: based on baseline conversion rate, minimum detectable effect, desired power, and significance level. Use before starting the experiment to know how long to run.
Common Pitfalls
- Peeking problem: stopping the experiment early when you see significance causes inflated false positive rates. Fix: pre-register sample size and run until full.
- Multiple testing: testing 20 metrics with p<0.05 threshold → 1 will appear significant by chance. Fix: Bonferroni correction or designate a single primary metric.
- Network effects: variant B leaks to control group via social interactions (e.g., a new feature one user sees affects their friends in the control group). Fix: cluster randomization (assign all users in a social cluster to the same variant).
- Novelty effect: users engage more with new features simply because they’re new. Run experiments long enough (minimum 2 weeks) to see past the novelty.
Experiment Configuration Service
Experiments are configured in a web UI and stored in DB. The Assignment Service fetches experiment configs at startup and caches them. Config updates propagate via a pub/sub notification (Redis Pub/Sub or Kafka) — all Assignment Service instances reload within seconds. Feature flags (kill switches): if an experiment variant is causing errors, flip the flag to 0% allocation — all users instantly roll back to control without a deployment.
Key Design Decisions
- Deterministic hash (user_id + experiment_id) ensures consistent assignment across all services without shared state
- Cache assignments in Redis — avoids DB lookup on every request
- Log exposures separately from assignments — only include actually exposed users in analysis
- Pre-register sample size and primary metric before running — prevents p-value fishing
LinkedIn system design covers A/B testing at scale. See common questions for LinkedIn interview: A/B testing platform system design.
Airbnb system design covers experimentation and A/B testing platforms. Review design patterns for Airbnb interview: A/B testing and experimentation platform design.
Twitter system design covers A/B testing and feature experiments. See design patterns for Twitter/X interview: A/B testing and feature experimentation design.