What is an A/B Testing Platform?
An A/B testing platform (also called experimentation platform) enables product teams to run controlled experiments: split traffic between variants (A=control, B=treatment), collect metrics, and determine which variant wins with statistical significance. Used at Google, Meta, Netflix, and Airbnb to make data-driven product decisions at scale.
Requirements
- Define experiments: name, variants (A/B or multi-variant), traffic allocation (%)
- Assign users to variants consistently (same user always sees same variant)
- Log exposure events (user assigned to variant) and conversion events
- Compute statistical significance (p-value, confidence intervals) for experiment results
- Support feature flags (ship to 10% of users without a formal experiment)
- 100M daily active users, assignment decision <5ms
Data Model
Experiment(experiment_id UUID, name VARCHAR, status ENUM(DRAFT,RUNNING,PAUSED,CONCLUDED),
start_date, end_date, owner, description,
primary_metric VARCHAR, -- e.g. 'checkout_conversion_rate'
traffic_percent FLOAT) -- fraction of eligible users enrolled (0.0-1.0)
Variant(variant_id UUID, experiment_id UUID, name VARCHAR,
allocation FLOAT, -- fraction of enrolled users (must sum to 1.0)
is_control BOOL)
ExposureEvent(event_id UUID, experiment_id UUID, variant_id UUID, user_id UUID,
timestamp TIMESTAMP, platform VARCHAR, app_version VARCHAR)
ConversionEvent(event_id UUID, experiment_id UUID, user_id UUID,
metric_name VARCHAR, value FLOAT, timestamp TIMESTAMP)
Variant Assignment
Deterministic assignment: given (user_id, experiment_id), always return the same variant. No DB lookup required — use hash-based bucketing:
def assign_variant(user_id: str, experiment: Experiment) -> Variant | None:
# Step 1: determine if user is in experiment traffic
enrollment_hash = mmh3.hash(f'{user_id}:{experiment.id}:enrollment') % 10000
if enrollment_hash >= experiment.traffic_percent * 10000:
return None # user not in experiment
# Step 2: assign to variant
variant_hash = mmh3.hash(f'{user_id}:{experiment.id}:variant') % 10000
cumulative = 0
for variant in experiment.variants:
cumulative += variant.allocation * 10000
if variant_hash < cumulative:
return variant
MurmurHash3 is fast (sub-microsecond) and distributes uniformly. Two separate hash calls (enrollment + variant) ensure independence between “is in experiment” and “which variant” decisions. Cache experiment configs in local memory (updated every 30s from config store) — no DB or Redis lookup on the hot path.
Event Logging
On each page/API request, the SDK calls assign_variant for all running experiments the user is eligible for. Exposure events are logged asynchronously:
# SDK side (non-blocking)
variant = assign_variant(user_id, experiment)
if variant:
event_queue.put({
'type': 'exposure',
'experiment_id': experiment.id,
'variant_id': variant.id,
'user_id': user_id,
'timestamp': now()
})
# Background thread flushes queue to Kafka every 500ms or 1000 events
kafka.produce('experiment-events', batch)
Kafka consumers write events to a data warehouse (BigQuery, Snowflake) for analysis. Avoid writing exposures to the OLTP DB — the volume (100M users × N experiments) requires columnar storage for aggregation queries.
Statistical Analysis
After collecting sufficient data, compute whether the difference between variants is statistically significant:
def compute_results(experiment_id, metric):
control_data = warehouse.query(
'SELECT value FROM ConversionEvent WHERE experiment_id=? AND variant=control')
treatment_data = warehouse.query(
'SELECT value FROM ConversionEvent WHERE experiment_id=? AND variant=treatment')
# Two-sample t-test for continuous metrics
t_stat, p_value = scipy.stats.ttest_ind(control_data, treatment_data)
control_mean = mean(control_data)
treatment_mean = mean(treatment_data)
relative_lift = (treatment_mean - control_mean) / control_mean
return {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'relative_lift': relative_lift, # e.g. +3.2%
'p_value': p_value,
'significant': p_value < 0.05
}
Feature Flags
Feature flags are a simplified variant of experiments: ship a feature to X% of users without statistical analysis. Implementation: same hash-based bucketing (user_id + flag_name), but no ConversionEvent tracking. Config: FeatureFlag(flag_id, name, enabled_percent, enabled_user_ids[], enabled_regions[]). Used for: gradual rollouts, kill switches (set enabled_percent=0 to disable), targeted access (specific user IDs for beta testing).
Key Design Decisions
- Hash-based deterministic assignment — no DB lookup on the hot path, sub-millisecond decision
- In-memory experiment config cache — refreshed every 30s, eliminates per-request network call
- Async exposure logging via Kafka — decouples experiment decision from event persistence
- Data warehouse for analysis — columnar storage handles 100M+ events efficiently; OLTP DB cannot
- Two-sample t-test for significance — well understood, supported by scipy/statsmodels
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does variant assignment work in an A/B testing platform?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use deterministic hash-based bucketing: hash(user_id + experiment_id) % 10000 to determine if a user is enrolled (enrollment hash), then a second hash for variant selection. This is sub-microsecond, requires no DB lookup, and always returns the same variant for the same user. Cache experiment configs in process memory refreshed every 30 seconds — the assignment hot path has zero network calls.”}},{“@type”:”Question”,”name”:”How do you avoid exposing users to multiple experiment variants?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use experiment-specific hash salts: hash(user_id + experiment_id + "enrollment") for enrollment and hash(user_id + experiment_id + "variant") for variant selection. The separate salts ensure the enrollment and variant decisions are statistically independent. Each experiment gets its own namespace so assignments don’t correlate across experiments.”}},{“@type”:”Question”,”name”:”How do you log exposure events at 100M users/day scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Log exposures asynchronously from a client-side SDK. The SDK buffers events in memory and flushes to Kafka every 500ms or per 1000 events, whichever comes first. Never write exposures to an OLTP database — the volume (100M users × N experiments) requires a columnar data warehouse like BigQuery or Snowflake for aggregation queries.”}},{“@type”:”Question”,”name”:”How do you determine statistical significance of A/B test results?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a two-sample t-test for continuous metrics (revenue, session duration) or a chi-squared test for proportions (conversion rate, click-through rate). Compute p-value: if p < 0.05, the result is statistically significant at 95% confidence. Also report relative lift (treatment_mean – control_mean) / control_mean and confidence intervals. Use scipy.stats.ttest_ind or equivalent.”}},{“@type”:”Question”,”name”:”What is the difference between a feature flag and an A/B test?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A feature flag ships a feature to a percentage of users without statistical analysis — it’s a traffic control mechanism, not an experiment. An A/B test tracks conversion events, computes statistical significance, and produces a winner. Feature flags use the same hash-based bucketing but with simpler config (enabled_percent, enabled_user_ids). Flags support gradual rollouts and kill switches; experiments support data-driven decisions.”}}]}
A/B testing platform design is a common topic at Google system design interview questions.
Experimentation infrastructure is frequently discussed in Meta system design interview preparation.
A/B testing and experimentation platforms are core to Netflix system design interview guide.