Dark Launch System Low-Level Design: Shadow Traffic, Comparison Testing, and Confidence Scoring

Dark Launch System Low-Level Design

A dark launch (also called shadow testing or dark traffic) deploys a new service version alongside the production service, mirrors real traffic to it asynchronously, and compares responses — without affecting users. The goal is to validate correctness, performance, and edge-case behavior under production load before promoting the new version. This guide covers the full low-level design: request mirroring, response comparison, divergence categorization, confidence scoring, and promotion criteria.

Shadow Traffic Architecture

Real requests are handled by the control (production) service as normal. A copy of each request is sent asynchronously to the shadow (new) service. The shadow response is discarded from the user's perspective — the user always receives the control response. Shadow failures do not affect production availability.

Two implementation approaches:

  • Application-level mirroring: the calling service forks the request after getting the control response, sends it to the shadow via a fire-and-forget thread or async task queue.
  • Sidecar/proxy mirroring: a service mesh sidecar (e.g., Envoy mirror filter) duplicates the request at the network layer without application code changes. This is preferred for transparency and correctness, as it mirrors the exact bytes including headers.

Both approaches must ensure that mirrored requests do not produce observable side effects: shadow calls to payment, email, or SMS services must be intercepted or stubbed. The shadow environment needs a separate stub layer for any side-effectful downstream dependency.

Request Mirroring Implementation

At the application level, mirroring uses a background thread pool to send the cloned request. The mirror is best-effort: if the shadow service is unavailable or slow, the mirror is dropped — never blocking the critical path.

import threading
import requests
import hashlib
import json
import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class DarkLaunchExperimentConfig:
    experiment_id: int
    name: str
    control_service_url: str
    shadow_service_url: str
    sample_rate: float          # 0.0 to 1.0
    status: str                 # ACTIVE, PAUSED, PROMOTED
    target_confidence: float    # e.g. 0.99

def mirror_request(
    experiment: DarkLaunchExperimentConfig,
    method: str,
    path: str,
    headers: dict,
    body: Optional[bytes],
    control_response: requests.Response,
) -> None:
    """
    Fire-and-forget mirror of a request to the shadow service.
    Comparison result is stored asynchronously.
    """
    import random
    if random.random() > experiment.sample_rate:
        return  # Not sampled

    def _send_and_compare():
        try:
            start = time.monotonic()
            shadow_resp = requests.request(
                method=method,
                url=experiment.shadow_service_url + path,
                headers={k: v for k, v in headers.items() if k.lower() != "host"},
                data=body,
                timeout=5.0,
            )
            latency_ms = int((time.monotonic() - start) * 1000)
            _store_comparison(experiment, method, path, body, control_response, shadow_resp, latency_ms)
        except Exception:
            pass  # Shadow failure never propagates

    thread = threading.Thread(target=_send_and_compare, daemon=True)
    thread.start()


def compare_responses(
    control_resp: requests.Response,
    shadow_resp: requests.Response,
) -> str:
    """
    Returns one of: EXACT_MATCH, SEMANTIC_MATCH, DIVERGENT, ERROR
    """
    if shadow_resp is None:
        return "ERROR"

    # Status code must match
    if control_resp.status_code != shadow_resp.status_code:
        return "DIVERGENT"

    # Try JSON comparison (field-level, ignoring insignificant differences)
    try:
        ctrl_json = control_resp.json()
        shad_json = shadow_resp.json()
        if ctrl_json == shad_json:
            return "EXACT_MATCH"
        # Semantic match: same keys, same values after normalizing timestamps and UUIDs
        ctrl_normalized = _normalize(ctrl_json)
        shad_normalized = _normalize(shad_json)
        if ctrl_normalized == shad_normalized:
            return "SEMANTIC_MATCH"
        return "DIVERGENT"
    except Exception:
        pass

    # Fallback: raw body comparison
    if control_resp.content == shadow_resp.content:
        return "EXACT_MATCH"
    return "DIVERGENT"


def _normalize(obj):
    """Strip volatile fields (timestamps, UUIDs, request IDs) for semantic comparison."""
    import re
    if isinstance(obj, dict):
        return {k: _normalize(v) for k, v in obj.items()
                if k not in {"timestamp", "request_id", "trace_id", "created_at", "updated_at"}}
    if isinstance(obj, list):
        return [_normalize(i) for i in obj]
    if isinstance(obj, str):
        # Mask UUIDs
        uuid_pattern = r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
        return re.sub(uuid_pattern, "UUID", obj, flags=re.IGNORECASE)
    return obj

SQL Schema

CREATE TABLE dark_launch_experiment (
    id                  SERIAL      PRIMARY KEY,
    name                VARCHAR(128) NOT NULL UNIQUE,
    control_service     VARCHAR(256) NOT NULL,
    shadow_service      VARCHAR(256) NOT NULL,
    sample_rate         NUMERIC(4,3) NOT NULL DEFAULT 0.01,
    status              VARCHAR(32)  NOT NULL DEFAULT 'ACTIVE',
    target_confidence   NUMERIC(5,4) NOT NULL DEFAULT 0.99,
    created_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    updated_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE TABLE dark_launch_sample (
    id                  BIGSERIAL   PRIMARY KEY,
    experiment_id       INT         NOT NULL REFERENCES dark_launch_experiment(id),
    request_hash        CHAR(64)    NOT NULL,
    control_status      SMALLINT    NOT NULL,
    shadow_status       SMALLINT,
    match_type          VARCHAR(32)  NOT NULL,
    latency_delta_ms    INT,
    sampled_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_dls_experiment_sampled ON dark_launch_sample (experiment_id, sampled_at DESC);
CREATE INDEX idx_dls_match ON dark_launch_sample (experiment_id, match_type);

Confidence Scoring

import psycopg2

def compute_confidence(experiment_id: int, window_minutes: int = 60) -> dict:
    """
    Compute match rate and p99 latency delta for the experiment over the last window_minutes.
    Returns a dict with confidence score and promotion readiness.
    """
    with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
        with conn.cursor() as cur:
            cur.execute(
                """
                SELECT
                    COUNT(*)                                            AS total,
                    COUNT(*) FILTER (WHERE match_type IN ('EXACT_MATCH','SEMANTIC_MATCH'))
                                                                        AS matched,
                    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY ABS(latency_delta_ms))
                                                                        AS p99_latency_delta
                FROM dark_launch_sample
                WHERE experiment_id = %s
                  AND sampled_at > NOW() - INTERVAL '1 minute' * %s
                """,
                (experiment_id, window_minutes),
            )
            row = cur.fetchone()

        with conn.cursor() as cur:
            cur.execute(
                "SELECT target_confidence FROM dark_launch_experiment WHERE id = %s",
                (experiment_id,),
            )
            target_row = cur.fetchone()

    if not row or row[0] == 0:
        return {"confidence": 0.0, "total": 0, "ready_to_promote": False}

    total, matched, p99_delta = row
    confidence = matched / total if total > 0 else 0.0
    target = float(target_row[0]) if target_row else 0.99

    # Promotion criteria: confidence >= target AND p99 latency delta within 10%
    # (latency delta threshold assumed to be 50ms absolute for this example)
    latency_ok = p99_delta is None or p99_delta = target and latency_ok,
    }

Promotion Criteria

A shadow service is ready for promotion when:

  • Confidence score (match rate) is at or above the target threshold (typically 99%).
  • Minimum sample count has been reached (e.g., 10,000 requests) to ensure statistical significance.
  • Shadow p99 response latency is within 10% of the control p99.
  • Shadow error rate is not higher than the control error rate.
  • The experiment has been running for a minimum wall-clock duration (e.g., 24 hours) to cover traffic patterns across different times of day.

When all criteria are met, update dark_launch_experiment.status = 'PROMOTED' and cut over traffic using the existing deployment mechanism (feature flag, weighted routing, or a full canary deploy).

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top