Canary Deployment System: Low-Level Design
A canary deployment system routes a small fraction of production traffic to a new version of a service while the rest continues to run the stable version. It monitors error rates, latency, and custom metrics in the canary cohort, and either promotes the canary to 100% or automatically rolls it back if metrics degrade. This design covers traffic splitting, metric collection, automated guardrail evaluation, and the promotion/rollback state machine.
Core Data Model
CREATE TABLE Deployment (
deployment_id BIGSERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
image_tag VARCHAR(200) NOT NULL, -- "payments-service:v2.3.4"
status VARCHAR(30) NOT NULL DEFAULT 'canary',
-- canary, promoting, stable, rolling_back, rolled_back, failed
canary_pct SMALLINT NOT NULL DEFAULT 5, -- % of traffic on new version
target_pct SMALLINT NOT NULL DEFAULT 100,
baseline_deployment_id BIGINT REFERENCES Deployment(deployment_id),
auto_promote BOOLEAN NOT NULL DEFAULT TRUE,
promote_after_minutes INT NOT NULL DEFAULT 30,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
promoted_at TIMESTAMPTZ,
rolled_back_at TIMESTAMPTZ,
rollback_reason TEXT
);
CREATE TABLE CanaryGuardrail (
guardrail_id SERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
metric_name VARCHAR(100) NOT NULL, -- 'error_rate', 'p99_latency_ms', 'custom_metric'
max_delta_pct NUMERIC(6,2) NOT NULL, -- max % degradation vs baseline
absolute_max NUMERIC(12,4), -- hard cap regardless of baseline
evaluation_window_minutes INT NOT NULL DEFAULT 5,
is_active BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE TABLE CanaryMetricSample (
sample_id BIGSERIAL PRIMARY KEY,
deployment_id BIGINT NOT NULL,
cohort VARCHAR(10) NOT NULL, -- 'canary' or 'baseline'
metric_name VARCHAR(100) NOT NULL,
value NUMERIC(12,4) NOT NULL,
sampled_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE CanaryEvaluation (
eval_id BIGSERIAL PRIMARY KEY,
deployment_id BIGINT NOT NULL,
evaluated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
result VARCHAR(20) NOT NULL, -- pass, fail, insufficient_data
details JSONB NOT NULL DEFAULT '{}',
action_taken VARCHAR(30) -- promoted, rolled_back, none
);
CREATE INDEX ON CanaryMetricSample(deployment_id, cohort, metric_name, sampled_at DESC);
CREATE INDEX ON Deployment(service_name, status);
Traffic Splitting
import hashlib
def get_deployment_version(service_name: str, request_id: str) -> str:
"""
Returns 'canary' or 'baseline' for a given request.
Deterministic: same request_id always routes to the same version.
"""
deployment = db.fetchone("""
SELECT deployment_id, image_tag, canary_pct, status
FROM Deployment
WHERE service_name=%s AND status='canary'
ORDER BY started_at DESC LIMIT 1
""", (service_name,))
if not deployment:
return 'baseline' # no active canary
bucket = int(hashlib.md5(
f"{deployment['deployment_id']}:{request_id}".encode()
).hexdigest()[:4], 16) % 100
return 'canary' if bucket < deployment['canary_pct'] else 'baseline'
# In the load balancer / API gateway:
# version = get_deployment_version('payments-service', str(request.user_id))
# if version == 'canary':
# forward_to(CANARY_UPSTREAM)
# else:
# forward_to(STABLE_UPSTREAM)
Guardrail Evaluation
import statistics
def evaluate_canary(deployment_id: int) -> dict:
"""
Compare canary metrics vs baseline metrics.
Auto-promotes if all guardrails pass + sufficient time elapsed.
Auto-rolls back if any guardrail fails.
"""
deployment = db.fetchone(
"SELECT * FROM Deployment WHERE deployment_id=%s", (deployment_id,)
)
if not deployment or deployment['status'] != 'canary':
return {'result': 'skipped'}
guardrails = db.fetchall("""
SELECT * FROM CanaryGuardrail
WHERE service_name=%s AND is_active=TRUE
""", (deployment['service_name'],))
failures = []
details = {}
for g in guardrails:
canary_val = _compute_metric(deployment_id, 'canary', g['metric_name'],
g['evaluation_window_minutes'])
baseline_val = _compute_metric(deployment_id, 'baseline', g['metric_name'],
g['evaluation_window_minutes'])
if canary_val is None or baseline_val is None:
details[g['metric_name']] = 'insufficient_data'
continue
# Check relative degradation
if baseline_val > 0:
delta_pct = (canary_val - baseline_val) / baseline_val * 100
if delta_pct > g['max_delta_pct']:
failures.append({
'metric': g['metric_name'],
'canary': canary_val,
'baseline': baseline_val,
'delta_pct': round(delta_pct, 2),
'threshold_pct': g['max_delta_pct'],
})
# Check absolute cap
if g['absolute_max'] and canary_val > g['absolute_max']:
failures.append({
'metric': g['metric_name'],
'canary': canary_val,
'absolute_max': g['absolute_max'],
})
details[g['metric_name']] = {
'canary': canary_val,
'baseline': baseline_val,
}
result = 'fail' if failures else 'pass'
action = None
if failures:
_rollback(deployment_id, str(failures))
action = 'rolled_back'
elif result == 'pass':
elapsed_minutes = (
datetime.datetime.utcnow() - deployment['started_at']
).total_seconds() / 60
if deployment['auto_promote'] and elapsed_minutes >= deployment['promote_after_minutes']:
_promote(deployment_id)
action = 'promoted'
db.execute("""
INSERT INTO CanaryEvaluation (deployment_id, result, details, action_taken)
VALUES (%s,%s,%s,%s)
""", (deployment_id, result, json.dumps({**details, 'failures': failures}), action))
return {'result': result, 'failures': failures, 'action': action}
def _compute_metric(deployment_id, cohort, metric_name, window_minutes):
rows = db.fetchall("""
SELECT value FROM CanaryMetricSample
WHERE deployment_id=%s AND cohort=%s AND metric_name=%s
AND sampled_at >= NOW() - INTERVAL '%s minutes'
""", (deployment_id, cohort, metric_name, window_minutes))
if len(rows) < 10: # require minimum sample size
return None
values = [float(r['value']) for r in rows]
if 'p99' in metric_name:
values.sort()
return values[int(len(values) * 0.99)]
return statistics.mean(values)
def _rollback(deployment_id, reason):
db.execute("""
UPDATE Deployment SET status='rolling_back', rollback_reason=%s
WHERE deployment_id=%s
""", (reason[:500], deployment_id))
# Trigger orchestrator to shift all traffic back to stable
def _promote(deployment_id):
db.execute("""
UPDATE Deployment SET status='stable', promoted_at=NOW(), canary_pct=100
WHERE deployment_id=%s
""", (deployment_id,))
Key Design Decisions
- Relative delta guardrails (not absolute thresholds): checking that canary error rate is <0.1% misses a service where baseline is already 0.5% — the canary could be 2× worse and still pass. Delta percentage comparison (canary < baseline * 1.2) correctly detects degradation relative to the current baseline regardless of absolute level.
- Deterministic hash routing: hashing request_id against the deployment_id ensures the same user always hits the same version during the canary window — avoiding split-brain user experiences where a user sees different behavior on consecutive requests. Include the deployment_id in the seed so the same user can be in different cohorts for different deployments.
- Minimum sample size requirement: evaluating a canary with only 3 data points produces unreliable results. Requiring at least 10 samples before evaluating prevents false failures during the initial ramp-up. Scale the minimum with canary_pct: at 1% traffic, you need 10× more time to accumulate the same sample count as at 10%.
- Gradual promotion vs. instant: instead of jumping from 5% to 100%, advance canary_pct in steps (5% → 10% → 25% → 50% → 100%) with a guardrail evaluation between each step. Each step increases the blast radius of a bad release but provides more data. Implement by updating canary_pct in the Deployment row and re-running traffic splitting.
Canary deployment and progressive delivery system design is discussed in Netflix system design interview questions.
Canary deployment and safe release management design is covered in Uber system design interview preparation.
Canary deployment and traffic splitting design is discussed in Airbnb system design interview guide.