Canary Deployment System Low-Level Design: Traffic Splitting, Guardrail Evaluation, and Automated Rollback

Canary Deployment System: Low-Level Design

A canary deployment system routes a small fraction of production traffic to a new version of a service while the rest continues to run the stable version. It monitors error rates, latency, and custom metrics in the canary cohort, and either promotes the canary to 100% or automatically rolls it back if metrics degrade. This design covers traffic splitting, metric collection, automated guardrail evaluation, and the promotion/rollback state machine.

Core Data Model

CREATE TABLE Deployment (
    deployment_id  BIGSERIAL PRIMARY KEY,
    service_name   VARCHAR(100) NOT NULL,
    image_tag      VARCHAR(200) NOT NULL,     -- "payments-service:v2.3.4"
    status         VARCHAR(30) NOT NULL DEFAULT 'canary',
        -- canary, promoting, stable, rolling_back, rolled_back, failed
    canary_pct     SMALLINT NOT NULL DEFAULT 5,  -- % of traffic on new version
    target_pct     SMALLINT NOT NULL DEFAULT 100,
    baseline_deployment_id BIGINT REFERENCES Deployment(deployment_id),
    auto_promote   BOOLEAN NOT NULL DEFAULT TRUE,
    promote_after_minutes INT NOT NULL DEFAULT 30,
    started_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    promoted_at    TIMESTAMPTZ,
    rolled_back_at TIMESTAMPTZ,
    rollback_reason TEXT
);

CREATE TABLE CanaryGuardrail (
    guardrail_id   SERIAL PRIMARY KEY,
    service_name   VARCHAR(100) NOT NULL,
    metric_name    VARCHAR(100) NOT NULL,     -- 'error_rate', 'p99_latency_ms', 'custom_metric'
    max_delta_pct  NUMERIC(6,2) NOT NULL,     -- max % degradation vs baseline
    absolute_max   NUMERIC(12,4),            -- hard cap regardless of baseline
    evaluation_window_minutes INT NOT NULL DEFAULT 5,
    is_active      BOOLEAN NOT NULL DEFAULT TRUE
);

CREATE TABLE CanaryMetricSample (
    sample_id      BIGSERIAL PRIMARY KEY,
    deployment_id  BIGINT NOT NULL,
    cohort         VARCHAR(10) NOT NULL,      -- 'canary' or 'baseline'
    metric_name    VARCHAR(100) NOT NULL,
    value          NUMERIC(12,4) NOT NULL,
    sampled_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE CanaryEvaluation (
    eval_id        BIGSERIAL PRIMARY KEY,
    deployment_id  BIGINT NOT NULL,
    evaluated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    result         VARCHAR(20) NOT NULL,      -- pass, fail, insufficient_data
    details        JSONB NOT NULL DEFAULT '{}',
    action_taken   VARCHAR(30)               -- promoted, rolled_back, none
);

CREATE INDEX ON CanaryMetricSample(deployment_id, cohort, metric_name, sampled_at DESC);
CREATE INDEX ON Deployment(service_name, status);

Traffic Splitting

import hashlib

def get_deployment_version(service_name: str, request_id: str) -> str:
    """
    Returns 'canary' or 'baseline' for a given request.
    Deterministic: same request_id always routes to the same version.
    """
    deployment = db.fetchone("""
        SELECT deployment_id, image_tag, canary_pct, status
        FROM Deployment
        WHERE service_name=%s AND status='canary'
        ORDER BY started_at DESC LIMIT 1
    """, (service_name,))

    if not deployment:
        return 'baseline'  # no active canary

    bucket = int(hashlib.md5(
        f"{deployment['deployment_id']}:{request_id}".encode()
    ).hexdigest()[:4], 16) % 100

    return 'canary' if bucket < deployment['canary_pct'] else 'baseline'

# In the load balancer / API gateway:
# version = get_deployment_version('payments-service', str(request.user_id))
# if version == 'canary':
#     forward_to(CANARY_UPSTREAM)
# else:
#     forward_to(STABLE_UPSTREAM)

Guardrail Evaluation

import statistics

def evaluate_canary(deployment_id: int) -> dict:
    """
    Compare canary metrics vs baseline metrics.
    Auto-promotes if all guardrails pass + sufficient time elapsed.
    Auto-rolls back if any guardrail fails.
    """
    deployment = db.fetchone(
        "SELECT * FROM Deployment WHERE deployment_id=%s", (deployment_id,)
    )
    if not deployment or deployment['status'] != 'canary':
        return {'result': 'skipped'}

    guardrails = db.fetchall("""
        SELECT * FROM CanaryGuardrail
        WHERE service_name=%s AND is_active=TRUE
    """, (deployment['service_name'],))

    failures = []
    details = {}

    for g in guardrails:
        canary_val = _compute_metric(deployment_id, 'canary', g['metric_name'],
                                     g['evaluation_window_minutes'])
        baseline_val = _compute_metric(deployment_id, 'baseline', g['metric_name'],
                                       g['evaluation_window_minutes'])

        if canary_val is None or baseline_val is None:
            details[g['metric_name']] = 'insufficient_data'
            continue

        # Check relative degradation
        if baseline_val > 0:
            delta_pct = (canary_val - baseline_val) / baseline_val * 100
            if delta_pct > g['max_delta_pct']:
                failures.append({
                    'metric': g['metric_name'],
                    'canary': canary_val,
                    'baseline': baseline_val,
                    'delta_pct': round(delta_pct, 2),
                    'threshold_pct': g['max_delta_pct'],
                })

        # Check absolute cap
        if g['absolute_max'] and canary_val > g['absolute_max']:
            failures.append({
                'metric': g['metric_name'],
                'canary': canary_val,
                'absolute_max': g['absolute_max'],
            })

        details[g['metric_name']] = {
            'canary': canary_val,
            'baseline': baseline_val,
        }

    result = 'fail' if failures else 'pass'
    action = None

    if failures:
        _rollback(deployment_id, str(failures))
        action = 'rolled_back'
    elif result == 'pass':
        elapsed_minutes = (
            datetime.datetime.utcnow() - deployment['started_at']
        ).total_seconds() / 60
        if deployment['auto_promote'] and elapsed_minutes >= deployment['promote_after_minutes']:
            _promote(deployment_id)
            action = 'promoted'

    db.execute("""
        INSERT INTO CanaryEvaluation (deployment_id, result, details, action_taken)
        VALUES (%s,%s,%s,%s)
    """, (deployment_id, result, json.dumps({**details, 'failures': failures}), action))

    return {'result': result, 'failures': failures, 'action': action}

def _compute_metric(deployment_id, cohort, metric_name, window_minutes):
    rows = db.fetchall("""
        SELECT value FROM CanaryMetricSample
        WHERE deployment_id=%s AND cohort=%s AND metric_name=%s
          AND sampled_at >= NOW() - INTERVAL '%s minutes'
    """, (deployment_id, cohort, metric_name, window_minutes))
    if len(rows) < 10:  # require minimum sample size
        return None
    values = [float(r['value']) for r in rows]
    if 'p99' in metric_name:
        values.sort()
        return values[int(len(values) * 0.99)]
    return statistics.mean(values)

def _rollback(deployment_id, reason):
    db.execute("""
        UPDATE Deployment SET status='rolling_back', rollback_reason=%s
        WHERE deployment_id=%s
    """, (reason[:500], deployment_id))
    # Trigger orchestrator to shift all traffic back to stable

def _promote(deployment_id):
    db.execute("""
        UPDATE Deployment SET status='stable', promoted_at=NOW(), canary_pct=100
        WHERE deployment_id=%s
    """, (deployment_id,))

Key Design Decisions

Relative delta guardrails (not absolute thresholds): checking that canary error rate is <0.1% misses a service where baseline is already 0.5% — the canary could be 2× worse and still pass. Delta percentage comparison (canary < baseline * 1.2) correctly detects degradation relative to the current baseline regardless of absolute level.
Deterministic hash routing: hashing request_id against the deployment_id ensures the same user always hits the same version during the canary window — avoiding split-brain user experiences where a user sees different behavior on consecutive requests. Include the deployment_id in the seed so the same user can be in different cohorts for different deployments.
Minimum sample size requirement: evaluating a canary with only 3 data points produces unreliable results. Requiring at least 10 samples before evaluating prevents false failures during the initial ramp-up. Scale the minimum with canary_pct: at 1% traffic, you need 10× more time to accumulate the same sample count as at 10%.
Gradual promotion vs. instant: instead of jumping from 5% to 100%, advance canary_pct in steps (5% → 10% → 25% → 50% → 100%) with a guardrail evaluation between each step. Each step increases the blast radius of a bad release but provides more data. Implement by updating canary_pct in the Deployment row and re-running traffic splitting.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does a canary deployment differ from blue-green deployment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Blue-green deployment: maintain two identical production environments (blue = current, green = new version). All traffic switches instantly from blue to green at cutover. Rollback is instant (switch back to blue). Problem: requires double the infrastructure; the switch is instant so there is no gradual validation window. Canary deployment: route a small percentage (1–10%) to the new version, observe metrics for 10–30 minutes, then gradually increase to 100%. Rollback sends the canary traffic back to stable. Advantage: gradual validation with real production traffic; much less infrastructure (the same cluster runs both versions simultaneously). Disadvantage: mixed-version state must be carefully managed (no backwards-incompatible DB schema changes during a canary). Both are zero-downtime deployment strategies. Blue-green is simpler and more appropriate for batch jobs or services with stateful sessions; canary is more appropriate for stateless APIs where gradual traffic shifting is easy.”}},{“@type”:”Question”,”name”:”How do you ensure database schema changes are compatible with canary deployments?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”During a canary, both old and new application code run simultaneously. A schema change that removes a column breaks the old code; adding a column with a NOT NULL default breaks old code that doesn’t supply it. Compatibility rules for canary deployments: (1) only additive changes during a canary window: add columns (nullable), add tables, add indexes; (2) never remove or rename during a canary: the old code still uses the old name; (3) use the expand/contract pattern: add new column → deploy canary (writes both old and new) → promote → drop old column in a later migration; (4) API responses: the new version’s responses should include all fields the old version expected — add new fields but don’t remove existing ones during the canary window. Gate any breaking schema changes behind a feature flag that is OFF during the canary and only enabled after full promotion.”}},{“@type”:”Question”,”name”:”How do you handle stateful sessions during a canary deployment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Session stickiness: if a user’s first request hits the canary, route all subsequent requests in the same session to the canary (until the session expires). Without stickiness, a user might get canary behavior on one page and stable behavior on the next — creating inconsistent experiences. Implementation: set a cookie (X-Canary-Version: canary or stable) on the first request and use that cookie to route subsequent requests to the same version. The load balancer checks the cookie and routes accordingly. Stateless APIs (JWT authentication, no server-side sessions) don’t need stickiness — each request is independent and it is acceptable for the same user to hit different versions on different requests, as long as the API response contract is consistent between versions.”}},{“@type”:”Question”,”name”:”What metrics should trigger an automatic rollback during a canary?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Rollback guardrails in priority order: (1) error rate: canary error rate > baseline error rate by more than 20% relative. This is the most important signal — errors are immediately user-visible. Threshold: max_delta_pct=20, metric_name=’error_rate’; (2) p99 latency: canary p99 exceeds baseline by more than 50% (users experiencing slow responses). max_delta_pct=50, metric_name=’p99_latency_ms’; (3) business metrics: conversion rate for canary users vs. control — if checkout conversion drops by more than 5% relative, rollback. This requires instrumenting business KPIs as custom metric samples; (4) hard caps: absolute max regardless of baseline — error rate > 2% (even if baseline is 2%, a canary at 4% is unacceptable), p99 > 5,000ms. Don’t rollback on: minor latency increases within variance (p99 5% higher — noise), CPU/memory spikes without user-visible impact. Require minimum 10 samples before evaluating to avoid false positives from initial traffic ramp-up.”}},{“@type”:”Question”,”name”:”How do you implement a gradual promotion (0% → 1% → 5% → 25% → 100%) with automatic advancement?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Progressive delivery: define advancement stages as a schedule in the Deployment record or a separate CanarySchedule table: [{pct:1, soak_minutes:10}, {pct:5, soak_minutes:15}, {pct:25, soak_minutes:30}, {pct:100, soak_minutes:0}]. The evaluate_canary() job runs every minute. When all guardrails pass and the current stage’s soak_minutes have elapsed: advance to the next stage by updating canary_pct and updating the load balancer’s routing weights. Implementation: the advancement check adds: elapsed_at_current_pct = (NOW() – last_pct_change_at).total_seconds() / 60. If elapsed >= stage.soak_minutes and guardrails pass: advance to next stage. If the service handles 10K RPM and canary_pct=1, the canary gets 100 RPM — enough to accumulate meaningful error rate statistics within 10 minutes. At 5%, 500 RPM provides faster signal accumulation.”}}]}

Canary deployment and progressive delivery system design is discussed in Netflix system design interview questions.

Canary deployment and safe release management design is covered in Uber system design interview preparation.

Canary deployment and traffic splitting design is discussed in Airbnb system design interview guide.