Low Level Design: Experiment Framework

What Is an Experiment Framework?

An Experiment Framework is the foundational infrastructure layer that enables product teams to run controlled experiments (A/B tests, multivariate tests, holdouts, and feature rollouts) at scale across an organization. Unlike a single A/B testing tool, a framework is a platform that many teams share simultaneously, enforcing consistency in randomization, metric ownership, and statistical rigor. It is the backbone of data-driven product development at companies like Google, Meta, and Uber.

Data Model

-- Namespaces prevent experiment collisions
namespaces (
  id          BIGINT PRIMARY KEY AUTO_INCREMENT,
  key         VARCHAR(128) UNIQUE,
  description TEXT,
  total_slots INT DEFAULT 1000   -- slots available for experiments
);

-- Experiments claim slots within a namespace
experiments (
  id            BIGINT PRIMARY KEY AUTO_INCREMENT,
  namespace_id  BIGINT REFERENCES namespaces(id),
  key           VARCHAR(128) UNIQUE,
  type          ENUM('ab','multivariate','holdout','rollout'),
  slot_start    INT,
  slot_end      INT,
  owner_team    VARCHAR(128),
  status        ENUM('draft','running','stopped','archived'),
  created_at    TIMESTAMP
);

-- Variants within experiments
variants (
  id            BIGINT PRIMARY KEY AUTO_INCREMENT,
  experiment_id BIGINT REFERENCES experiments(id),
  key           VARCHAR(64),
  allocation    FLOAT,     -- fraction of experiment traffic
  config        JSON
);

-- Metrics registered by teams
metrics (
  id            BIGINT PRIMARY KEY AUTO_INCREMENT,
  key           VARCHAR(128) UNIQUE,
  name          VARCHAR(256),
  type          ENUM('conversion','mean','ratio','percentile'),
  event_type    VARCHAR(128),
  aggregation   VARCHAR(32)   -- sum, count, avg, p95
);

-- Experiment-metric bindings (primary + guardrail metrics)
experiment_metrics (
  experiment_id BIGINT,
  metric_id     BIGINT,
  role          ENUM('primary','guardrail','informational'),
  PRIMARY KEY (experiment_id, metric_id)
);

Core Algorithm: Namespace-Based Randomization

The framework must support many simultaneous experiments without their assignments interfering with each other. The namespace model solves this:

Namespace slot assignment: hash the entity ID within a namespace to one of N slots (e.g., 1000). Each experiment claims a contiguous range of slots.
Experiment lookup: find which experiment owns the entity’s slot. If none, the entity is in the holdout or unallocated pool.
Variant assignment: within the matched experiment, apply a second hash (using the experiment key as a salt) to assign the entity to a variant by allocation weights.
Layer stacking: different namespaces represent independent layers (e.g., UI layer, ranking layer, pricing layer). An entity participates in one experiment per layer simultaneously, and cross-layer independence is guaranteed because salts differ.

Pseudocode for a single layer lookup:

slot = hash(namespace_key + entity_id) mod total_slots
experiment = find_experiment_by_slot(namespace_id, slot)
if experiment is None: return default_config
variant_bucket = hash(experiment.key + entity_id) mod 100
variant = assign_by_cumulative_weight(experiment.variants, variant_bucket)
return variant.config

Metric Collection and Analysis Pipeline

The framework ingests raw events from application services and joins them with assignment data to produce per-variant metric aggregates:

Event ingestion: application services emit events (clicks, conversions, latency samples) to a Kafka topic tagged with entity ID and timestamp.
Assignment join: a streaming job (Flink or Spark Structured Streaming) looks up the entity’s experiment assignment at event time and enriches each event record.
Aggregation: enriched events land in a columnar store (ClickHouse, BigQuery, Druid). Scheduled jobs compute per-variant metric values, standard errors, and p-values or Bayesian posteriors.
Guardrail alerts: if a guardrail metric (e.g., error rate, p99 latency) degrades beyond a threshold, the framework triggers an alert and can optionally auto-stop the experiment.

Failure Handling and Performance

SDK-local evaluation: all assignment logic runs in-process using a cached config snapshot. No network call is on the critical path.
Config propagation SLA: target < 30 seconds from a flag/experiment change to all SDK instances seeing the update. Use SSE push + local polling fallback.
Idempotent event delivery: event producers include a UUID; the ingestion layer deduplicates on write to prevent inflated metric counts during retries.
Experiment collision detection: the framework UI warns if two experiments targeting the same population would compete for slots, preventing unintentional under-allocation.

Scalability Considerations

Config serving: publish experiment configs as versioned JSON bundles to a CDN. SDKs download diffs rather than full snapshots to minimize bandwidth.
Metric store partitioning: partition by experiment ID and date. Retention policies archive raw events after 90 days while preserving aggregates indefinitely.
Self-service and governance: at scale, hundreds of experiments run simultaneously. The framework must provide a UI for experiment creation, a review workflow for statistical setup, and automated checks (minimum detectable effect, required sample size) before an experiment goes live.
Holdout groups: reserve a global holdout (e.g., 1–2% of users who see no experiments) to measure the cumulative effect of all shipped features over time.

Summary

An Experiment Framework is a multi-tenant experimentation infrastructure that enforces statistical correctness, prevents cross-experiment contamination, and scales to support an entire engineering organization. The key design decisions are namespace-based slot allocation for isolation, SDK-local evaluation for performance, and a streaming pipeline for metric enrichment. In interviews, emphasize the difference between a one-off A/B test and a shared framework: the latter requires governance, collision prevention, guardrail automation, and a robust config delivery system.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does an experiment framework differ from a basic A/B testing platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An experiment framework is a generalized infrastructure layer that supports multiple experiment types — A/B tests, multivariate tests, interleaving experiments, and quasi-experiments — under a unified assignment and analysis API. It provides extensible hooks for custom metrics, allocation strategies, and statistical methods rather than hardcoding a single paradigm. Companies like Google and Netflix build experiment frameworks so that every product team can run rigorous experiments without rebuilding the scaffolding from scratch.”
}
},
{
“@type”: “Question”,
“name”: “How would you design the assignment and allocation subsystem of an experiment framework?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The allocation subsystem maps an entity (user, session, request) to a variant by hashing the entity ID and experiment ID together, then mapping the result into the defined traffic split. It must support heterogeneous allocation units — user-level for personalization experiments, cookie-level for logged-out traffic, and request-level for latency experiments. The subsystem also manages the experiment namespace to prevent over-allocation when the sum of all active experiment traffic exceeds 100% of users.”
}
},
{
“@type”: “Question”,
“name”: “What logging and observability requirements does an experiment framework impose?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Every assignment must be logged with the entity ID, experiment ID, variant, timestamp, and any context attributes used in targeting so that the analysis layer can reconstruct the exposed population exactly. The framework should emit structured assignment events to a durable stream (Kafka or Pub/Sub) and guarantee at-least-once delivery to avoid data loss that would bias results. Observability dashboards should show assignment counts, traffic percentages, and data-pipeline lag so experiment owners can detect instrumentation issues early.”
}
},
{
“@type”: “Question”,
“name”: “How do you design an experiment framework to scale to thousands of concurrent experiments?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “At scale, the experiment configuration is compiled into a compact, versioned binary blob (e.g., a serialized proto) that is pushed to edge caches and SDK clients, making assignment evaluation a pure in-process computation with no network calls. The control plane uses a layered namespace model where experiments are grouped into layers with mutual exclusion within a layer but independence across layers, allowing thousands of concurrent experiments to coexist without interference. The analysis pipeline is partitioned by experiment ID and runs on a distributed query engine so that growth in experiment count does not increase per-experiment analysis latency.”
}
}
]
}