Experiment Platform Low-Level Design: Multi-Armed Bandit, Holdout Groups, and Guardrail Metrics

Requirements and Constraints

A full-featured experiment platform extends simple A/B testing to support adaptive allocation (multi-armed bandits), holdout groups, guardrail metrics that can auto-pause experiments, and collision detection to prevent confounded results when the same user is exposed to conflicting experiments simultaneously. This is the system that sits behind large-scale product teams running hundreds of experiments per quarter.

Functional requirements include multi-armed bandit (MAB) allocation that adaptively routes more traffic to better-performing variants, persistent holdout groups that are excluded from all experiments for long-term baseline measurement, guardrail metric monitoring with automated experiment pausing, and an experiment collision graph that flags when two experiments modify overlapping product surfaces.

Scale Assumptions

500 active experiments at peak; 50 of them MAB-based
Assignment service: 20,000 RPS
Holdout group: 5% of users, globally excluded
Bandit update cycle: every 10 minutes

Core Data Model

CREATE TABLE experiments (
  id                UUID PRIMARY KEY,
  name              VARCHAR(256) UNIQUE NOT NULL,
  experiment_type   VARCHAR(32) NOT NULL,  -- ab, bandit, holdout
  status            VARCHAR(32) NOT NULL,
  allocation_salt   VARCHAR(128) NOT NULL,
  layers            TEXT[],               -- layer names for collision detection
  traffic_pct       SMALLINT NOT NULL,
  bandit_config     JSONB,                -- epsilon, decay, prior params for MAB
  created_at        TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Current bandit arm weights, updated by bandit job
CREATE TABLE bandit_arm_weights (
  experiment_id     UUID NOT NULL REFERENCES experiments(id),
  variant_id        UUID NOT NULL,
  weight            DOUBLE PRECISION NOT NULL,
  updated_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
  sample_size       BIGINT NOT NULL,
  reward_sum        DOUBLE PRECISION NOT NULL,
  PRIMARY KEY (experiment_id, variant_id)
);

-- Holdout group membership (stable, computed once at user creation or first assignment)
CREATE TABLE holdout_members (
  user_id           VARCHAR(256) PRIMARY KEY,
  assigned_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Guardrail monitoring results
CREATE TABLE guardrail_alerts (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  experiment_id     UUID NOT NULL,
  metric_name       VARCHAR(128) NOT NULL,
  variant_id        UUID NOT NULL,
  detected_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
  p_value           DOUBLE PRECISION,
  relative_change   DOUBLE PRECISION,
  action_taken      VARCHAR(64) NOT NULL  -- paused, alerted, suppressed
);

-- Experiment collision graph: which layers each experiment occupies
CREATE TABLE experiment_layers (
  experiment_id     UUID NOT NULL,
  layer_name        VARCHAR(128) NOT NULL,
  surface           VARCHAR(256) NOT NULL,  -- e.g. checkout_page, email_subject
  PRIMARY KEY (experiment_id, layer_name, surface)
);

Multi-Armed Bandit Allocation

MAB experiments use Thompson Sampling with a Beta prior for binary reward metrics (conversion) or a Normal-Gamma prior for continuous rewards (revenue). The bandit job runs every 10 minutes:

For each MAB experiment, fetch cumulative reward_sum and sample_size per variant from bandit_arm_weights.
Update Beta distribution parameters: alpha = 1 + conversions, beta = 1 + (sample_size - conversions).
Sample 10,000 random draws from each arm's posterior. The new weight for arm i is the fraction of draws where arm i had the highest sampled value.
Write updated weights back to bandit_arm_weights atomically using a single UPDATE with optimistic concurrency check on updated_at.

The assignment service refreshes bandit weights from the config cache every 60 seconds (slightly lagged relative to the update cycle). Assignment uses the current weight vector: hash the user into [0, 1) and walk the cumulative weight distribution to select a variant. This is still deterministic per user per refresh window — weights change at most once per minute, so a user's assignment may change between sessions but is stable within a session if session duration is tracked.

Holdout Groups

Holdout groups are computed by hashing user IDs against a global holdout salt and keeping a fixed percentage (5%) in the holdout. Holdout membership is checked before any experiment assignment. Holdout users are tracked in holdout_members for explicit querying, but the hash check is the authoritative source (membership can be recomputed from user ID alone).

Holdout groups serve two purposes: they provide a long-term holdback baseline that is unaffected by any experiments, enabling measurement of cumulative experiment effects over months; and they ensure that power calculations for individual experiments are not contaminated by holdout users receiving different product experiences.

Guardrail Metrics and Auto-Pausing

Each experiment can designate metrics as guardrails (e.g., checkout error rate, p99 latency, app crash rate). The guardrail monitor runs every 5 minutes and applies a sequential test (mSPRT) that controls the Type I error rate under continuous monitoring. If a guardrail metric shows a statistically significant degradation in any treatment variant:

The experiment status is set to paused atomically in the config store.
The assignment service's next cache refresh picks up the paused status and routes all traffic to control.
A guardrail_alerts record is inserted and an on-call alert is fired via PagerDuty.
The pause is logged in the experiment audit trail with the triggering metric and p-value.

Auto-pause is opt-in per experiment and per guardrail metric, with configurable minimum sample size thresholds to prevent false positives in the first hours of a launch.

Experiment Collision Detection

Two experiments collide if they both modify the same product surface and a user can be assigned to both simultaneously, potentially creating an interaction effect that confounds both results. The collision detector runs at experiment creation and on any layer assignment change:

Fetch all running experiments whose layers array intersects with the new experiment's layers.
For each overlapping experiment, compute the expected fraction of users exposed to both: P(A) * P(B) where P is the traffic percentage divided by 100. If this fraction exceeds a threshold (e.g., 1%), flag a collision warning.
Hard conflicts (same layer, same surface, same user segment) block experiment launch and require explicit override with reviewer approval.

Mutual exclusion layers (disjoint hash buckets) can be configured for experiments that must not overlap, ensuring no user is in more than one experiment within a layer by partitioning the hash space.

API Design

POST /v1/experiments                    -- create (type: ab | bandit | holdout)
GET  /v1/experiments/{id}/collisions    -- check for collisions before launch
POST /v1/experiments/{id}/launch        -- validate and set status to running
GET  /v1/bandit/{id}/weights            -- current arm weights and posterior params
GET  /v1/holdout/check?user_id=         -- is this user in the holdout?
GET  /v1/guardrails/{experiment_id}     -- current guardrail status and alerts

Scalability Considerations

Bandit update throughput: The bandit job processes 50 MAB experiments in parallel, each requiring one read and one write to the weights table. At 10-minute intervals the database load is trivial.
Holdout at scale: Hash-based holdout requires zero database lookups on the assignment path. The holdout_members table exists for analytics, not for serving.
Config store: Experiment configurations (status, weights, variants) are stored in a Redis cluster that the assignment service treats as the source of truth for serving. PostgreSQL is the durable store; changes are written to both with a write-through policy.
Collision graph scalability: With 500 active experiments, the collision check scans at most 500 rows indexed by layer name — a trivially fast query. The graph is recomputed only on experiment state changes, not on every assignment.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Thompson Sampling work for multi-armed bandits in an experiment platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Thompson Sampling maintains a Beta distribution over each arm's conversion probability, initialized with a conjugate prior. At each assignment decision it samples a probability from every arm's distribution and routes the user to the arm with the highest sample. As successes and failures accumulate, the posteriors tighten, naturally concentrating traffic on better-performing variants. This balances exploration and exploitation without a fixed exploration budget, making it well-suited for continuous experimentation pipelines.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement a WORM holdout group in an experiment platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A holdout group is a stable slice of users excluded from all experiments for a defined period, used to measure the cumulative long-term effect of shipped features. Assignments are written once to an immutable store—enforced by WORM policy or an append-only ledger—so membership cannot drift as users cycle through unrelated experiments. The holdout boundary is hashed on a stable user identifier and validated on every experiment enrollment to prevent accidental inclusion.”
}
},
{
“@type”: “Question”,
“name”: “How does a guardrail metric auto-pause work in an experiment platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Guardrail metrics define acceptable bounds for critical indicators such as error rate, latency p99, or crash rate. The platform computes these metrics continuously and compares them against pre-configured thresholds using sequential testing or a simple z-test at each evaluation window. If a variant breaches a guardrail boundary with statistical significance, the system automatically pauses traffic allocation to that variant, sends an alert, and logs the stopping reason—allowing engineers to investigate before any manual re-enable.”
}
},
{
“@type”: “Question”,
“name”: “How do you detect mutual-exclusion collisions between experiments on an experiment platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each experiment declares the layers and user segments it targets. Before launch, the platform checks for overlap by intersecting segment definitions and layer occupancy maps; any experiment pair that could simultaneously assign the same user to different treatments in the same layer is flagged as a collision. Orthogonal hashing—assigning each layer a different hash seed—ensures that users who do appear in multiple experiments are independently randomized, preserving statistical validity even when full mutual exclusion is relaxed.”
}
}
]
}