A/B Testing Platform Low-Level Design

What is an A/B Testing Platform?

An A/B testing platform (also called experimentation platform) enables product teams to run controlled experiments: split traffic between variants (A=control, B=treatment), collect metrics, and determine which variant wins with statistical significance. Used at Google, Meta, Netflix, and Airbnb to make data-driven product decisions at scale.

Requirements

  • Define experiments: name, variants (A/B or multi-variant), traffic allocation (%)
  • Assign users to variants consistently (same user always sees same variant)
  • Log exposure events (user assigned to variant) and conversion events
  • Compute statistical significance (p-value, confidence intervals) for experiment results
  • Support feature flags (ship to 10% of users without a formal experiment)
  • 100M daily active users, assignment decision <5ms

Data Model

Experiment(experiment_id UUID, name VARCHAR, status ENUM(DRAFT,RUNNING,PAUSED,CONCLUDED),
           start_date, end_date, owner, description,
           primary_metric VARCHAR,   -- e.g. 'checkout_conversion_rate'
           traffic_percent FLOAT)    -- fraction of eligible users enrolled (0.0-1.0)

Variant(variant_id UUID, experiment_id UUID, name VARCHAR,
        allocation FLOAT,   -- fraction of enrolled users (must sum to 1.0)
        is_control BOOL)

ExposureEvent(event_id UUID, experiment_id UUID, variant_id UUID, user_id UUID,
              timestamp TIMESTAMP, platform VARCHAR, app_version VARCHAR)

ConversionEvent(event_id UUID, experiment_id UUID, user_id UUID,
                metric_name VARCHAR, value FLOAT, timestamp TIMESTAMP)

Variant Assignment

Deterministic assignment: given (user_id, experiment_id), always return the same variant. No DB lookup required — use hash-based bucketing:

def assign_variant(user_id: str, experiment: Experiment) -> Variant | None:
    # Step 1: determine if user is in experiment traffic
    enrollment_hash = mmh3.hash(f'{user_id}:{experiment.id}:enrollment') % 10000
    if enrollment_hash >= experiment.traffic_percent * 10000:
        return None  # user not in experiment

    # Step 2: assign to variant
    variant_hash = mmh3.hash(f'{user_id}:{experiment.id}:variant') % 10000
    cumulative = 0
    for variant in experiment.variants:
        cumulative += variant.allocation * 10000
        if variant_hash < cumulative:
            return variant

MurmurHash3 is fast (sub-microsecond) and distributes uniformly. Two separate hash calls (enrollment + variant) ensure independence between “is in experiment” and “which variant” decisions. Cache experiment configs in local memory (updated every 30s from config store) — no DB or Redis lookup on the hot path.

Event Logging

On each page/API request, the SDK calls assign_variant for all running experiments the user is eligible for. Exposure events are logged asynchronously:

# SDK side (non-blocking)
variant = assign_variant(user_id, experiment)
if variant:
    event_queue.put({
        'type': 'exposure',
        'experiment_id': experiment.id,
        'variant_id': variant.id,
        'user_id': user_id,
        'timestamp': now()
    })

# Background thread flushes queue to Kafka every 500ms or 1000 events
kafka.produce('experiment-events', batch)

Kafka consumers write events to a data warehouse (BigQuery, Snowflake) for analysis. Avoid writing exposures to the OLTP DB — the volume (100M users × N experiments) requires columnar storage for aggregation queries.

Statistical Analysis

After collecting sufficient data, compute whether the difference between variants is statistically significant:

def compute_results(experiment_id, metric):
    control_data = warehouse.query(
        'SELECT value FROM ConversionEvent WHERE experiment_id=? AND variant=control')
    treatment_data = warehouse.query(
        'SELECT value FROM ConversionEvent WHERE experiment_id=? AND variant=treatment')

    # Two-sample t-test for continuous metrics
    t_stat, p_value = scipy.stats.ttest_ind(control_data, treatment_data)

    control_mean = mean(control_data)
    treatment_mean = mean(treatment_data)
    relative_lift = (treatment_mean - control_mean) / control_mean

    return {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'relative_lift': relative_lift,   # e.g. +3.2%
        'p_value': p_value,
        'significant': p_value < 0.05
    }

Feature Flags

Feature flags are a simplified variant of experiments: ship a feature to X% of users without statistical analysis. Implementation: same hash-based bucketing (user_id + flag_name), but no ConversionEvent tracking. Config: FeatureFlag(flag_id, name, enabled_percent, enabled_user_ids[], enabled_regions[]). Used for: gradual rollouts, kill switches (set enabled_percent=0 to disable), targeted access (specific user IDs for beta testing).

Key Design Decisions

  • Hash-based deterministic assignment — no DB lookup on the hot path, sub-millisecond decision
  • In-memory experiment config cache — refreshed every 30s, eliminates per-request network call
  • Async exposure logging via Kafka — decouples experiment decision from event persistence
  • Data warehouse for analysis — columnar storage handles 100M+ events efficiently; OLTP DB cannot
  • Two-sample t-test for significance — well understood, supported by scipy/statsmodels

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does variant assignment work in an A/B testing platform?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use deterministic hash-based bucketing: hash(user_id + experiment_id) % 10000 to determine if a user is enrolled (enrollment hash), then a second hash for variant selection. This is sub-microsecond, requires no DB lookup, and always returns the same variant for the same user. Cache experiment configs in process memory refreshed every 30 seconds — the assignment hot path has zero network calls.”}},{“@type”:”Question”,”name”:”How do you avoid exposing users to multiple experiment variants?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use experiment-specific hash salts: hash(user_id + experiment_id + "enrollment") for enrollment and hash(user_id + experiment_id + "variant") for variant selection. The separate salts ensure the enrollment and variant decisions are statistically independent. Each experiment gets its own namespace so assignments don’t correlate across experiments.”}},{“@type”:”Question”,”name”:”How do you log exposure events at 100M users/day scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Log exposures asynchronously from a client-side SDK. The SDK buffers events in memory and flushes to Kafka every 500ms or per 1000 events, whichever comes first. Never write exposures to an OLTP database — the volume (100M users × N experiments) requires a columnar data warehouse like BigQuery or Snowflake for aggregation queries.”}},{“@type”:”Question”,”name”:”How do you determine statistical significance of A/B test results?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a two-sample t-test for continuous metrics (revenue, session duration) or a chi-squared test for proportions (conversion rate, click-through rate). Compute p-value: if p < 0.05, the result is statistically significant at 95% confidence. Also report relative lift (treatment_mean – control_mean) / control_mean and confidence intervals. Use scipy.stats.ttest_ind or equivalent.”}},{“@type”:”Question”,”name”:”What is the difference between a feature flag and an A/B test?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A feature flag ships a feature to a percentage of users without statistical analysis — it’s a traffic control mechanism, not an experiment. An A/B test tracks conversion events, computes statistical significance, and produces a winner. Feature flags use the same hash-based bucketing but with simpler config (enabled_percent, enabled_user_ids). Flags support gradual rollouts and kill switches; experiments support data-driven decisions.”}}]}

A/B testing platform design is a common topic at Google system design interview questions.

Experimentation infrastructure is frequently discussed in Meta system design interview preparation.

A/B testing and experimentation platforms are core to Netflix system design interview guide.

Scroll to Top