How does champion-challenger traffic splitting work in model serving?

The routing proxy reads the ServingPolicy record for the requested model. It routes a percentage of traffic (challenger_weight, e.g. 10%) to the challenger version and the remainder to the champion. Routing is typically deterministic per entity_id using a hash bucket so the same user consistently hits the same model version during an experiment, reducing noise in outcome metrics.

What is shadow scoring and when should you use it?

In shadow mode the routing proxy sends every request to the champion model (whose prediction is returned to the caller) and simultaneously to a shadow model (whose prediction is logged but discarded). No user is affected by the shadow model's output. Use shadow scoring to validate a new model's prediction distribution, latency profile, and error rate under real production traffic before exposing any users to it.

What TTL should be used for inference caching?

For fully deterministic models (same input always produces same output), a TTL of minutes to hours is safe — bounded by how quickly the underlying model is updated. For models with time-sensitive features (e.g., real-time user context), TTL should be seconds or caching should be avoided entirely. Always key the cache on (model_version_id, hash(input)) so a model swap automatically invalidates stale entries.

How is prediction drift detected in a serving system?

Prediction drift detection compares the distribution of recent predictions (from PredictionLog) against the distribution observed at training evaluation time. Metrics include mean prediction shift, positive-rate change for classifiers, and score histogram divergence. Automated alerts fire when drift exceeds a threshold, signaling concept drift, upstream feature degradation, or a bug in the serving pipeline.

How does the champion-challenger traffic split work?

ServingPolicy stores champion_version and challenger_version with a challenger_weight percentage; the routing proxy hashes the entity_id and routes requests below the weight threshold to the challenger, ensuring the same user always hits the same model.

How does shadow scoring work without affecting users?

The champion model scores the request and returns the result to the user; simultaneously, the challenger model scores the same input asynchronously (fire-and-forget); both predictions are logged to PredictionLog for offline comparison.

How does inference caching reduce model server load?

The cache key is (model_version_id + sha256(serialized_input)); a cache hit returns the stored prediction immediately; only deterministic models (no randomness) are eligible for caching.

How is prediction drift detected in production?

PredictionMetric records rolling distribution statistics (mean prediction, class distribution) computed from recent PredictionLog entries; deviations beyond a threshold from the training baseline trigger alerts.

ML Model Serving System Low-Level Design: Versioned Deployment, A/B Testing, Shadow Mode, and Monitoring

⏱ 8 min read

Why Model Serving Is Not Trivial

Deploying a model to production is an engineering discipline separate from training one. A serving system must handle versioned artifacts, route traffic across model variants, validate inputs and outputs, cache predictions efficiently, and detect when a deployed model starts behaving differently from its training baseline — all while keeping latency in the low tens of milliseconds. This post designs those components at the low level.

Model Registry

The registry is the catalog of all trained model artifacts. Each ModelVersion record stores:

artifact_path: S3 URI to the serialized model (TorchScript, SavedModel, ONNX, or joblib-serialized sklearn pipeline).
framework: pytorch, tensorflow, sklearn, xgboost.
input_schema: JSONB describing expected feature names, dtypes, and shapes — used for input validation at serving time.
metrics: JSONB with evaluation metrics (AUC, RMSE, precision@k) recorded at training time — the baseline for drift comparison.
status: staging | champion | challenger | shadow | retired.

Serving Architecture

The serving stack has three layers:

Model server: TorchServe or TF Serving instances that load model artifacts from S3 on startup and expose a gRPC or HTTP inference endpoint. Each instance serves one model version.
Routing proxy: A lightweight service (Go or Python FastAPI) that reads the ServingPolicy table, decides which model version(s) receive the request, fans out calls, and returns the champion result to the caller.
Feature fetch layer: Before calling the model server, the proxy fetches online features from the feature store using the entity_id, assembles the input tensor, and validates it against the input schema.

Versioned Deployment Strategies

Blue-Green Swap

The current champion is blue. A new version is deployed as green (status=staging). Integration tests run against green. When tests pass, the ServingPolicy record is updated atomically: champion_version = green_id. All new requests go to green. Blue is kept alive for 30 minutes to handle in-flight requests, then retired.

Canary Deployment

Set challenger_version = new_id and challenger_weight = 5 (5% of traffic). Monitor latency, error rate, and prediction distribution. Ramp challenger_weight to 50, then 100. If metrics degrade at any step, set challenger_weight back to 0 and rollback.

Rollback

Keep the previous champion record. On metric degradation — latency p99 spike, error rate increase, or prediction drift alert — the routing proxy flips champion_version back to the previous ID. Rollback is a database write; no artifact redeployment needed.

A/B Testing

Champion-challenger routing supports controlled experiments. The routing proxy hashes entity_id + model_name to assign users deterministically to a bucket 0-99. Users in buckets below challenger_weight receive predictions from the challenger; others receive champion predictions. Both paths write to PredictionLog with their model_version_id. Outcome events (conversions, ratings, downstream labels) are joined on entity_id to measure model impact.

Shadow Mode

Shadow mode evaluates a new model against real traffic without affecting users:

Routing proxy receives a request.
Sends request to champion synchronously — awaits response.
Sends the same request to the shadow model asynchronously (fire and forget, with a short timeout).
Returns champion prediction to caller.
Logs both champion and shadow predictions to PredictionLog with respective model_version_ids.

Offline analysis compares champion vs. shadow prediction distributions, latency distributions, and error rates before any live traffic is shifted.

Inference Caching

For deterministic models and inputs that repeat frequently (e.g., pricing a fixed catalog of items), Redis caching reduces model server load significantly:

Cache key: infer:{model_version_id}:{sha256(input_json)}
TTL: set based on model update frequency and feature freshness requirements.
On model version change, the new model_version_id in the key automatically bypasses stale cache entries without explicit invalidation.

SQL Schema

CREATE TABLE ModelVersion (
    id              BIGSERIAL PRIMARY KEY,
    model_name      VARCHAR(255) NOT NULL,
    version         VARCHAR(100) NOT NULL,
    artifact_path   TEXT         NOT NULL,  -- s3://bucket/path/model.pt
    framework       VARCHAR(50)  NOT NULL,
    input_schema    JSONB        NOT NULL,
    metrics         JSONB,
    status          VARCHAR(50)  NOT NULL DEFAULT 'staging',
    deployed_at     TIMESTAMPTZ,
    created_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    UNIQUE (model_name, version)
);

CREATE TABLE ServingPolicy (
    model_name          VARCHAR(255) PRIMARY KEY,
    champion_version    BIGINT NOT NULL REFERENCES ModelVersion(id),
    challenger_version  BIGINT REFERENCES ModelVersion(id),
    challenger_weight   INT    NOT NULL DEFAULT 0,  -- percent 0-100
    shadow_version      BIGINT REFERENCES ModelVersion(id)
);

CREATE TABLE PredictionLog (
    id               BIGSERIAL PRIMARY KEY,
    model_version_id BIGINT       NOT NULL REFERENCES ModelVersion(id),
    entity_id        VARCHAR(255) NOT NULL,
    input_hash       CHAR(64)     NOT NULL,  -- sha256 of input JSON
    prediction       JSONB        NOT NULL,
    latency_ms       INT          NOT NULL,
    predicted_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
CREATE INDEX ON PredictionLog (model_version_id, predicted_at DESC);

CREATE TABLE PredictionMetric (
    model_version_id BIGINT      NOT NULL REFERENCES ModelVersion(id),
    metric_name      VARCHAR(100) NOT NULL,
    value            FLOAT        NOT NULL,
    computed_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (model_version_id, metric_name, computed_at)
);

Python Interface

import hashlib, json, time, threading
import redis
import requests

r = redis.Redis(host="redis-serving", port=6379, decode_responses=True)

def predict(model_name: str, entity_id: str, input_data: dict) -> dict:
    """Main entry point: route request, check cache, call model server."""
    version_id, endpoint = route_request(model_name, entity_id)
    cache_key = f"infer:{version_id}:{hashlib.sha256(json.dumps(input_data, sort_keys=True).encode()).hexdigest()}"
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)
    t0 = time.time()
    resp = requests.post(endpoint + "/predictions", json=input_data, timeout=0.5)
    resp.raise_for_status()
    prediction = resp.json()
    latency_ms = int((time.time() - t0) * 1000)
    r.set(cache_key, json.dumps(prediction), ex=300)
    log_prediction(version_id, entity_id, cache_key.split(":")[-1], prediction, latency_ms)
    return prediction

def route_request(model_name: str, entity_id: str) -> tuple[int, str]:
    """Determine which model version serves this entity_id."""
    policy = _fetch_policy(model_name)
    if policy.get("challenger_version") and policy.get("challenger_weight", 0) > 0:
        bucket = int(hashlib.md5(f"{model_name}:{entity_id}".encode()).hexdigest(), 16) % 100
        if bucket  None:
    # INSERT INTO PredictionLog (model_version_id, entity_id, input_hash, prediction, latency_ms)
    pass

def detect_prediction_drift(model_version_id: int, recent_predictions: list[float], baseline_mean: float, baseline_std: float) -> bool:
    """Flag drift if recent prediction mean deviates more than 3 sigma from training baseline."""
    if not recent_predictions:
        return False
    import statistics
    sample_mean = statistics.mean(recent_predictions)
    if baseline_std == 0:
        return sample_mean != baseline_mean
    return abs(sample_mean - baseline_mean) / baseline_std > 3.0

def shadow_score(model_name: str, entity_id: str, input_data: dict) -> None:
    """Fire-and-forget shadow scoring; called from routing proxy."""
    policy = _fetch_policy(model_name)
    if not policy.get("shadow_version"):
        return
    def _score():
        try:
            endpoint = _endpoint(policy["shadow_version"])
            t0 = time.time()
            resp = requests.post(endpoint + "/predictions", json=input_data, timeout=0.5)
            prediction = resp.json()
            latency_ms = int((time.time() - t0) * 1000)
            log_prediction(policy["shadow_version"], entity_id, "", prediction, latency_ms)
        except Exception:
            pass
    threading.Thread(target=_score, daemon=True).start()

Prediction Monitoring Pipeline

A scheduled job reads recent rows from PredictionLog for each active model version, computes the prediction distribution (mean, p50, p95, positive rate for classifiers), writes to PredictionMetric, and compares against the baseline stored in ModelVersion.metrics. Alerts route to PagerDuty on threshold breach. Dashboards show prediction distribution over time alongside upstream feature drift signals from the feature store monitoring system.

Key Design Decisions

Routing policy is stored in the database, not in config files, so changes take effect without redeploying the proxy.
Shadow scoring uses fire-and-forget threads to avoid adding latency to the critical path. Shadow timeouts are strictly bounded.
Cache keys include model_version_id so version swaps automatically bypass stale cached predictions without explicit cache invalidation logic.
PredictionLog is write-heavy; partition by predicted_at monthly and archive to S3 after 30 days to control storage costs.