Canary Deployment System Low-Level Design: Traffic Splitting, Guardrail Evaluation, and Automated Rollback

Canary Deployment System: Low-Level Design

A canary deployment system routes a small fraction of production traffic to a new version of a service while the rest continues to run the stable version. It monitors error rates, latency, and custom metrics in the canary cohort, and either promotes the canary to 100% or automatically rolls it back if metrics degrade. This design covers traffic splitting, metric collection, automated guardrail evaluation, and the promotion/rollback state machine.

Core Data Model

CREATE TABLE Deployment (
    deployment_id  BIGSERIAL PRIMARY KEY,
    service_name   VARCHAR(100) NOT NULL,
    image_tag      VARCHAR(200) NOT NULL,     -- "payments-service:v2.3.4"
    status         VARCHAR(30) NOT NULL DEFAULT 'canary',
        -- canary, promoting, stable, rolling_back, rolled_back, failed
    canary_pct     SMALLINT NOT NULL DEFAULT 5,  -- % of traffic on new version
    target_pct     SMALLINT NOT NULL DEFAULT 100,
    baseline_deployment_id BIGINT REFERENCES Deployment(deployment_id),
    auto_promote   BOOLEAN NOT NULL DEFAULT TRUE,
    promote_after_minutes INT NOT NULL DEFAULT 30,
    started_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    promoted_at    TIMESTAMPTZ,
    rolled_back_at TIMESTAMPTZ,
    rollback_reason TEXT
);

CREATE TABLE CanaryGuardrail (
    guardrail_id   SERIAL PRIMARY KEY,
    service_name   VARCHAR(100) NOT NULL,
    metric_name    VARCHAR(100) NOT NULL,     -- 'error_rate', 'p99_latency_ms', 'custom_metric'
    max_delta_pct  NUMERIC(6,2) NOT NULL,     -- max % degradation vs baseline
    absolute_max   NUMERIC(12,4),            -- hard cap regardless of baseline
    evaluation_window_minutes INT NOT NULL DEFAULT 5,
    is_active      BOOLEAN NOT NULL DEFAULT TRUE
);

CREATE TABLE CanaryMetricSample (
    sample_id      BIGSERIAL PRIMARY KEY,
    deployment_id  BIGINT NOT NULL,
    cohort         VARCHAR(10) NOT NULL,      -- 'canary' or 'baseline'
    metric_name    VARCHAR(100) NOT NULL,
    value          NUMERIC(12,4) NOT NULL,
    sampled_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE CanaryEvaluation (
    eval_id        BIGSERIAL PRIMARY KEY,
    deployment_id  BIGINT NOT NULL,
    evaluated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    result         VARCHAR(20) NOT NULL,      -- pass, fail, insufficient_data
    details        JSONB NOT NULL DEFAULT '{}',
    action_taken   VARCHAR(30)               -- promoted, rolled_back, none
);

CREATE INDEX ON CanaryMetricSample(deployment_id, cohort, metric_name, sampled_at DESC);
CREATE INDEX ON Deployment(service_name, status);

Traffic Splitting

import hashlib

def get_deployment_version(service_name: str, request_id: str) -> str:
    """
    Returns 'canary' or 'baseline' for a given request.
    Deterministic: same request_id always routes to the same version.
    """
    deployment = db.fetchone("""
        SELECT deployment_id, image_tag, canary_pct, status
        FROM Deployment
        WHERE service_name=%s AND status='canary'
        ORDER BY started_at DESC LIMIT 1
    """, (service_name,))

    if not deployment:
        return 'baseline'  # no active canary

    bucket = int(hashlib.md5(
        f"{deployment['deployment_id']}:{request_id}".encode()
    ).hexdigest()[:4], 16) % 100

    return 'canary' if bucket < deployment['canary_pct'] else 'baseline'

# In the load balancer / API gateway:
# version = get_deployment_version('payments-service', str(request.user_id))
# if version == 'canary':
#     forward_to(CANARY_UPSTREAM)
# else:
#     forward_to(STABLE_UPSTREAM)

Guardrail Evaluation

import statistics

def evaluate_canary(deployment_id: int) -> dict:
    """
    Compare canary metrics vs baseline metrics.
    Auto-promotes if all guardrails pass + sufficient time elapsed.
    Auto-rolls back if any guardrail fails.
    """
    deployment = db.fetchone(
        "SELECT * FROM Deployment WHERE deployment_id=%s", (deployment_id,)
    )
    if not deployment or deployment['status'] != 'canary':
        return {'result': 'skipped'}

    guardrails = db.fetchall("""
        SELECT * FROM CanaryGuardrail
        WHERE service_name=%s AND is_active=TRUE
    """, (deployment['service_name'],))

    failures = []
    details = {}

    for g in guardrails:
        canary_val = _compute_metric(deployment_id, 'canary', g['metric_name'],
                                     g['evaluation_window_minutes'])
        baseline_val = _compute_metric(deployment_id, 'baseline', g['metric_name'],
                                       g['evaluation_window_minutes'])

        if canary_val is None or baseline_val is None:
            details[g['metric_name']] = 'insufficient_data'
            continue

        # Check relative degradation
        if baseline_val > 0:
            delta_pct = (canary_val - baseline_val) / baseline_val * 100
            if delta_pct > g['max_delta_pct']:
                failures.append({
                    'metric': g['metric_name'],
                    'canary': canary_val,
                    'baseline': baseline_val,
                    'delta_pct': round(delta_pct, 2),
                    'threshold_pct': g['max_delta_pct'],
                })

        # Check absolute cap
        if g['absolute_max'] and canary_val > g['absolute_max']:
            failures.append({
                'metric': g['metric_name'],
                'canary': canary_val,
                'absolute_max': g['absolute_max'],
            })

        details[g['metric_name']] = {
            'canary': canary_val,
            'baseline': baseline_val,
        }

    result = 'fail' if failures else 'pass'
    action = None

    if failures:
        _rollback(deployment_id, str(failures))
        action = 'rolled_back'
    elif result == 'pass':
        elapsed_minutes = (
            datetime.datetime.utcnow() - deployment['started_at']
        ).total_seconds() / 60
        if deployment['auto_promote'] and elapsed_minutes >= deployment['promote_after_minutes']:
            _promote(deployment_id)
            action = 'promoted'

    db.execute("""
        INSERT INTO CanaryEvaluation (deployment_id, result, details, action_taken)
        VALUES (%s,%s,%s,%s)
    """, (deployment_id, result, json.dumps({**details, 'failures': failures}), action))

    return {'result': result, 'failures': failures, 'action': action}

def _compute_metric(deployment_id, cohort, metric_name, window_minutes):
    rows = db.fetchall("""
        SELECT value FROM CanaryMetricSample
        WHERE deployment_id=%s AND cohort=%s AND metric_name=%s
          AND sampled_at >= NOW() - INTERVAL '%s minutes'
    """, (deployment_id, cohort, metric_name, window_minutes))
    if len(rows) < 10:  # require minimum sample size
        return None
    values = [float(r['value']) for r in rows]
    if 'p99' in metric_name:
        values.sort()
        return values[int(len(values) * 0.99)]
    return statistics.mean(values)

def _rollback(deployment_id, reason):
    db.execute("""
        UPDATE Deployment SET status='rolling_back', rollback_reason=%s
        WHERE deployment_id=%s
    """, (reason[:500], deployment_id))
    # Trigger orchestrator to shift all traffic back to stable

def _promote(deployment_id):
    db.execute("""
        UPDATE Deployment SET status='stable', promoted_at=NOW(), canary_pct=100
        WHERE deployment_id=%s
    """, (deployment_id,))

Key Design Decisions

  • Relative delta guardrails (not absolute thresholds): checking that canary error rate is <0.1% misses a service where baseline is already 0.5% — the canary could be 2× worse and still pass. Delta percentage comparison (canary < baseline * 1.2) correctly detects degradation relative to the current baseline regardless of absolute level.
  • Deterministic hash routing: hashing request_id against the deployment_id ensures the same user always hits the same version during the canary window — avoiding split-brain user experiences where a user sees different behavior on consecutive requests. Include the deployment_id in the seed so the same user can be in different cohorts for different deployments.
  • Minimum sample size requirement: evaluating a canary with only 3 data points produces unreliable results. Requiring at least 10 samples before evaluating prevents false failures during the initial ramp-up. Scale the minimum with canary_pct: at 1% traffic, you need 10× more time to accumulate the same sample count as at 10%.
  • Gradual promotion vs. instant: instead of jumping from 5% to 100%, advance canary_pct in steps (5% → 10% → 25% → 50% → 100%) with a guardrail evaluation between each step. Each step increases the blast radius of a bad release but provides more data. Implement by updating canary_pct in the Deployment row and re-running traffic splitting.

Canary deployment and progressive delivery system design is discussed in Netflix system design interview questions.

Canary deployment and safe release management design is covered in Uber system design interview preparation.

Canary deployment and traffic splitting design is discussed in Airbnb system design interview guide.

Scroll to Top