Dark Launch System Low-Level Design: Shadow Traffic, Comparison Testing, and Confidence Scoring

Dark Launch System Low-Level Design

A dark launch (also called shadow testing or dark traffic) deploys a new service version alongside the production service, mirrors real traffic to it asynchronously, and compares responses — without affecting users. The goal is to validate correctness, performance, and edge-case behavior under production load before promoting the new version. This guide covers the full low-level design: request mirroring, response comparison, divergence categorization, confidence scoring, and promotion criteria.

Shadow Traffic Architecture

Real requests are handled by the control (production) service as normal. A copy of each request is sent asynchronously to the shadow (new) service. The shadow response is discarded from the user's perspective — the user always receives the control response. Shadow failures do not affect production availability.

Two implementation approaches:

Application-level mirroring: the calling service forks the request after getting the control response, sends it to the shadow via a fire-and-forget thread or async task queue.
Sidecar/proxy mirroring: a service mesh sidecar (e.g., Envoy mirror filter) duplicates the request at the network layer without application code changes. This is preferred for transparency and correctness, as it mirrors the exact bytes including headers.

Both approaches must ensure that mirrored requests do not produce observable side effects: shadow calls to payment, email, or SMS services must be intercepted or stubbed. The shadow environment needs a separate stub layer for any side-effectful downstream dependency.

Request Mirroring Implementation

At the application level, mirroring uses a background thread pool to send the cloned request. The mirror is best-effort: if the shadow service is unavailable or slow, the mirror is dropped — never blocking the critical path.

import threading
import requests
import hashlib
import json
import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class DarkLaunchExperimentConfig:
    experiment_id: int
    name: str
    control_service_url: str
    shadow_service_url: str
    sample_rate: float          # 0.0 to 1.0
    status: str                 # ACTIVE, PAUSED, PROMOTED
    target_confidence: float    # e.g. 0.99

def mirror_request(
    experiment: DarkLaunchExperimentConfig,
    method: str,
    path: str,
    headers: dict,
    body: Optional[bytes],
    control_response: requests.Response,
) -> None:
    """
    Fire-and-forget mirror of a request to the shadow service.
    Comparison result is stored asynchronously.
    """
    import random
    if random.random() > experiment.sample_rate:
        return  # Not sampled

    def _send_and_compare():
        try:
            start = time.monotonic()
            shadow_resp = requests.request(
                method=method,
                url=experiment.shadow_service_url + path,
                headers={k: v for k, v in headers.items() if k.lower() != "host"},
                data=body,
                timeout=5.0,
            )
            latency_ms = int((time.monotonic() - start) * 1000)
            _store_comparison(experiment, method, path, body, control_response, shadow_resp, latency_ms)
        except Exception:
            pass  # Shadow failure never propagates

    thread = threading.Thread(target=_send_and_compare, daemon=True)
    thread.start()


def compare_responses(
    control_resp: requests.Response,
    shadow_resp: requests.Response,
) -> str:
    """
    Returns one of: EXACT_MATCH, SEMANTIC_MATCH, DIVERGENT, ERROR
    """
    if shadow_resp is None:
        return "ERROR"

    # Status code must match
    if control_resp.status_code != shadow_resp.status_code:
        return "DIVERGENT"

    # Try JSON comparison (field-level, ignoring insignificant differences)
    try:
        ctrl_json = control_resp.json()
        shad_json = shadow_resp.json()
        if ctrl_json == shad_json:
            return "EXACT_MATCH"
        # Semantic match: same keys, same values after normalizing timestamps and UUIDs
        ctrl_normalized = _normalize(ctrl_json)
        shad_normalized = _normalize(shad_json)
        if ctrl_normalized == shad_normalized:
            return "SEMANTIC_MATCH"
        return "DIVERGENT"
    except Exception:
        pass

    # Fallback: raw body comparison
    if control_resp.content == shadow_resp.content:
        return "EXACT_MATCH"
    return "DIVERGENT"


def _normalize(obj):
    """Strip volatile fields (timestamps, UUIDs, request IDs) for semantic comparison."""
    import re
    if isinstance(obj, dict):
        return {k: _normalize(v) for k, v in obj.items()
                if k not in {"timestamp", "request_id", "trace_id", "created_at", "updated_at"}}
    if isinstance(obj, list):
        return [_normalize(i) for i in obj]
    if isinstance(obj, str):
        # Mask UUIDs
        uuid_pattern = r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
        return re.sub(uuid_pattern, "UUID", obj, flags=re.IGNORECASE)
    return obj

SQL Schema

CREATE TABLE dark_launch_experiment (
    id                  SERIAL      PRIMARY KEY,
    name                VARCHAR(128) NOT NULL UNIQUE,
    control_service     VARCHAR(256) NOT NULL,
    shadow_service      VARCHAR(256) NOT NULL,
    sample_rate         NUMERIC(4,3) NOT NULL DEFAULT 0.01,
    status              VARCHAR(32)  NOT NULL DEFAULT 'ACTIVE',
    target_confidence   NUMERIC(5,4) NOT NULL DEFAULT 0.99,
    created_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    updated_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE TABLE dark_launch_sample (
    id                  BIGSERIAL   PRIMARY KEY,
    experiment_id       INT         NOT NULL REFERENCES dark_launch_experiment(id),
    request_hash        CHAR(64)    NOT NULL,
    control_status      SMALLINT    NOT NULL,
    shadow_status       SMALLINT,
    match_type          VARCHAR(32)  NOT NULL,
    latency_delta_ms    INT,
    sampled_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_dls_experiment_sampled ON dark_launch_sample (experiment_id, sampled_at DESC);
CREATE INDEX idx_dls_match ON dark_launch_sample (experiment_id, match_type);

Confidence Scoring

import psycopg2

def compute_confidence(experiment_id: int, window_minutes: int = 60) -> dict:
    """
    Compute match rate and p99 latency delta for the experiment over the last window_minutes.
    Returns a dict with confidence score and promotion readiness.
    """
    with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
        with conn.cursor() as cur:
            cur.execute(
                """
                SELECT
                    COUNT(*)                                            AS total,
                    COUNT(*) FILTER (WHERE match_type IN ('EXACT_MATCH','SEMANTIC_MATCH'))
                                                                        AS matched,
                    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY ABS(latency_delta_ms))
                                                                        AS p99_latency_delta
                FROM dark_launch_sample
                WHERE experiment_id = %s
                  AND sampled_at > NOW() - INTERVAL '1 minute' * %s
                """,
                (experiment_id, window_minutes),
            )
            row = cur.fetchone()

        with conn.cursor() as cur:
            cur.execute(
                "SELECT target_confidence FROM dark_launch_experiment WHERE id = %s",
                (experiment_id,),
            )
            target_row = cur.fetchone()

    if not row or row[0] == 0:
        return {"confidence": 0.0, "total": 0, "ready_to_promote": False}

    total, matched, p99_delta = row
    confidence = matched / total if total > 0 else 0.0
    target = float(target_row[0]) if target_row else 0.99

    # Promotion criteria: confidence >= target AND p99 latency delta within 10%
    # (latency delta threshold assumed to be 50ms absolute for this example)
    latency_ok = p99_delta is None or p99_delta = target and latency_ok,
    }

Promotion Criteria

A shadow service is ready for promotion when:

Confidence score (match rate) is at or above the target threshold (typically 99%).
Minimum sample count has been reached (e.g., 10,000 requests) to ensure statistical significance.
Shadow p99 response latency is within 10% of the control p99.
Shadow error rate is not higher than the control error rate.
The experiment has been running for a minimum wall-clock duration (e.g., 24 hours) to cover traffic patterns across different times of day.

When all criteria are met, update dark_launch_experiment.status = 'PROMOTED' and cut over traffic using the existing deployment mechanism (feature flag, weighted routing, or a full canary deploy).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why must shadow traffic mirroring be asynchronous?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Mirroring must be asynchronous so that shadow service latency or failures never affect the user-facing response time. If mirroring were synchronous, a slow shadow service would add its latency to every user request, and a crashed shadow would block production traffic. Async fire-and-forget mirroring ensures the critical path is never touched. The tradeoff is that you cannot guarantee every request is mirrored — a best-effort mirror is the correct design.”
}
},
{
“@type”: “Question”,
“name”: “How do I normalize responses for comparison when they contain timestamps or generated IDs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Build a normalization layer that strips or masks volatile fields before comparison. Common volatile fields include: timestamps (created_at, updated_at, response_time), request-scoped IDs (request_id, trace_id), and generated identifiers (UUIDs, auto-increment IDs). After normalization, compare the remaining structure. This gives you semantic equality — the responses carry the same business data — even if the raw bytes differ.”
}
},
{
“@type”: “Question”,
“name”: “What confidence threshold should I use before promoting from dark launch?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “99% match rate is a common starting threshold, but the right value depends on your risk tolerance and traffic volume. For a high-traffic service, 99% over 100,000 samples means 1,000 divergent responses — investigate whether those divergences are acceptable before promoting. Also enforce a minimum sample count and a minimum experiment duration to cover different traffic patterns. Latency parity (shadow p99 within 10% of control p99) is an independent promotion gate.”
}
},
{
“@type”: “Question”,
“name”: “How do I promote a service from dark launch to full production rollout?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When the confidence and latency criteria are met, promotion is a routing change: shift traffic from the control to the shadow (now the new production). Use a canary deployment first — route 1–5% of real traffic to the new service, monitor error rates and latency for 30 minutes, then ramp to 100%. The dark launch experiment gave you confidence in correctness; the canary ramp manages the risk of issues that only appear at full production load. Keep the old service running for a rollback window.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why must request mirroring be asynchronous?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The shadow request must not add latency to the user-facing response; firing it in a background thread or via a sidecar proxy keeps the critical path unaffected.”
}
},
{
“@type”: “Question”,
“name”: “How are responses normalized before comparison?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Timestamps, request IDs, and non-deterministic fields are stripped using a configurable field exclusion list before diffing; structural equality is then assessed on the remaining fields.”
}
},
{
“@type”: “Question”,
“name”: “What confidence threshold indicates readiness for promotion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A match rate above 99% sustained over at least 10,000 sampled requests, combined with shadow p99 latency within 10% of the control, is a common promotion signal.”
}
},
{
“@type”: “Question”,
“name”: “How is sample rate controlled to limit shadow load?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “DarkLaunchExperiment stores a sample_rate (0.0-1.0); a random draw per request determines whether that request is mirrored, keeping shadow traffic proportional to configured load.”
}
}
]
}