A/B Testing Platform Low-Level Design

What is A/B Testing?

A/B testing (controlled experimentation) randomly assigns users to experimental variants and measures which variant produces better outcomes. Core components: experiment configuration (which users, which variants, what percentage), assignment service (deterministic user-to-variant mapping), event tracking (log which users saw which variant and what they did), and statistical analysis (is the difference significant?).

Data Model

Experiment(experiment_id, name, status ENUM(DRAFT,RUNNING,PAUSED,CONCLUDED),
           start_date, end_date, owner, description)
Variant(variant_id, experiment_id, name, allocation_percent, config JSON)
Assignment(user_id, experiment_id, variant_id, assigned_at)
ExperimentMetric(experiment_id, metric_name, primary BOOL)
ExposureEvent(user_id, experiment_id, variant_id, timestamp, properties JSON)
ConversionEvent(user_id, event_type, timestamp, value)

User Assignment

Requirements: deterministic (same user always gets same variant), random (no systematic bias), consistent (variant doesn’t change mid-experiment). Implementation: hash(user_id + experiment_id) mod 100 → assign to variant based on allocation ranges.

import hashlib

def assign_variant(user_id, experiment_id, variants):
    # Deterministic hash: same user_id + experiment_id always gives same bucket
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    bucket = hash_value % 100  # 0-99

    cumulative = 0
    for variant in variants:
        cumulative += variant.allocation_percent
        if bucket < cumulative:
            return variant
    return variants[-1]  # fallback

Store the assignment in the Assignment table on first encounter. Cache in Redis (key=assign:{user_id}:{experiment_id}, TTL=24h) for fast lookups on subsequent requests.

Exposure Logging

Log an exposure event when the user actually sees the variant (not just when they’re assigned). Distinction: a user may be assigned to experiment E but only exposed if they visit the page where E is active. Log exposure: when the variant is rendered, publish ExposureEvent to Kafka. Only users with logged exposures are included in analysis. This prevents dilution (including unexposed users in the control group).

Statistical Analysis

For conversion metrics: two-proportion z-test. For continuous metrics (revenue, session duration): t-test or Mann-Whitney U test. Key concepts:

  • p-value: probability of seeing this large a difference by chance if the null hypothesis is true. Threshold p < 0.05 (5% false positive rate).
  • Statistical significance: reject the null hypothesis (no effect). Does NOT mean the effect is practically significant.
  • Power: probability of detecting a real effect. Power = 80% means 20% chance of missing a real effect. Need sufficient sample size for adequate power.
  • Sample size calculation: based on baseline conversion rate, minimum detectable effect, desired power, and significance level. Use before starting the experiment to know how long to run.

Common Pitfalls

  • Peeking problem: stopping the experiment early when you see significance causes inflated false positive rates. Fix: pre-register sample size and run until full.
  • Multiple testing: testing 20 metrics with p<0.05 threshold → 1 will appear significant by chance. Fix: Bonferroni correction or designate a single primary metric.
  • Network effects: variant B leaks to control group via social interactions (e.g., a new feature one user sees affects their friends in the control group). Fix: cluster randomization (assign all users in a social cluster to the same variant).
  • Novelty effect: users engage more with new features simply because they’re new. Run experiments long enough (minimum 2 weeks) to see past the novelty.

Experiment Configuration Service

Experiments are configured in a web UI and stored in DB. The Assignment Service fetches experiment configs at startup and caches them. Config updates propagate via a pub/sub notification (Redis Pub/Sub or Kafka) — all Assignment Service instances reload within seconds. Feature flags (kill switches): if an experiment variant is causing errors, flip the flag to 0% allocation — all users instantly roll back to control without a deployment.

Key Design Decisions

  • Deterministic hash (user_id + experiment_id) ensures consistent assignment across all services without shared state
  • Cache assignments in Redis — avoids DB lookup on every request
  • Log exposures separately from assignments — only include actually exposed users in analysis
  • Pre-register sample size and primary metric before running — prevents p-value fishing


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you assign users to A/B test variants deterministically?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Hash the user_id combined with the experiment_id: bucket = hash(user_id + ":" + experiment_id) mod 100. This gives a bucket 0-99. Assign to variant based on allocation ranges (e.g., 0-49=control, 50-99=treatment for a 50/50 split). Deterministic: the same user always gets the same bucket for the same experiment. Including experiment_id in the hash ensures the same user gets different (uncorrelated) assignments across different experiments. Use a consistent hash (MD5, FNV) not Python's built-in hash (which is salted and non-deterministic). Cache the assignment in Redis (key=assign:{user_id}:{experiment_id}, TTL=24h) after the first lookup to avoid recomputation on every request.”}},{“@type”:”Question”,”name”:”What is the difference between assignment and exposure in A/B testing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Assignment: the user is in the experiment population (eligible for a variant). Happens when the user meets the targeting criteria. Exposure: the user actually encountered the variant. Happens when the variant is rendered/served. A user can be assigned to an experiment but never exposed (if they never visit the page where the experiment is active). Analysis should only include exposed users — counting assigned-but-not-exposed users dilutes the treatment effect and causes false negatives. Log an exposure event (ExposureEvent) when the variant is actually served to the user. The analysis pipeline joins conversion events to exposure events, not just assignment records. This is the correct denominator for conversion rate calculations.”}},{“@type”:”Question”,”name”:”What is statistical significance and how do you interpret A/B test results?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Statistical significance (p-value < 0.05) means: if the variants were truly identical, there is less than 5% probability of observing a difference this large or larger by chance. It does NOT mean the effect is large or practically important. A statistically significant but tiny effect (e.g., 0.01% conversion rate increase) may not be worth shipping. Practical significance: the effect size must be large enough to matter for the business. Confidence interval: a 95% CI of [+0.5%, +2.0%] means the true effect is likely between +0.5% and +2.0% improvement. Never peek at results mid-experiment and stop early when significant — this inflates the false positive rate (sequential testing problem). Run experiments to the pre-calculated sample size.”}},{“@type”:”Question”,”name”:”How do you calculate the required sample size for an A/B test?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Required inputs: (1) baseline conversion rate (current conversion rate, e.g., 5%), (2) minimum detectable effect (MDE) — the smallest improvement worth detecting (e.g., 10% relative = 5.5% absolute), (3) desired power (typically 80% — 80% chance of detecting the effect if real), (4) significance level (α=0.05). Formula uses the normal approximation. Python: use scipy.stats or an online sample size calculator. Rule of thumb: to detect a 10% relative lift from a 5% baseline at 80% power, you need ~3000 users per variant. Smaller effects need more users: detecting 1% relative lift needs ~300,000 per variant. Estimate experiment duration: sample_size / daily_traffic_eligible_users = days to run.”}},{“@type”:”Question”,”name”:”How do you prevent the peeking problem in A/B testing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The peeking problem: if you check results every day and stop when p < 0.05, your actual false positive rate is much higher than 5% (can be 20-30%). This is because you get multiple chances to observe an extreme result by chance. Solutions: (1) Pre-register sample size and only analyze once at the end — the simplest approach. (2) Sequential testing (SPRT, mSPRT): mathematically corrects for continuous monitoring by using a different significance threshold. Allows valid early stopping with controlled false positive rate. Used by Booking.com and Netflix. (3) Bayesian A/B testing: update posterior beliefs as data arrives; no concept of "peeking." Report probability that treatment is better by X%. Does not control Type I error in the frequentist sense but is more intuitive.”}}]}

LinkedIn system design covers A/B testing at scale. See common questions for LinkedIn interview: A/B testing platform system design.

Airbnb system design covers experimentation and A/B testing platforms. Review design patterns for Airbnb interview: A/B testing and experimentation platform design.

Twitter system design covers A/B testing and feature experiments. See design patterns for Twitter/X interview: A/B testing and feature experimentation design.

Scroll to Top