Question 1

How does a canary deployment differ from blue-green deployment?

Accepted Answer

Blue-green deployment: maintain two identical production environments (blue = current, green = new version). All traffic switches instantly from blue to green at cutover. Rollback is instant (switch back to blue). Problem: requires double the infrastructure; the switch is instant so there is no gradual validation window. Canary deployment: route a small percentage (1–10%) to the new version, observe metrics for 10–30 minutes, then gradually increase to 100%. Rollback sends the canary traffic back to stable. Advantage: gradual validation with real production traffic; much less infrastructure (the same cluster runs both versions simultaneously). Disadvantage: mixed-version state must be carefully managed (no backwards-incompatible DB schema changes during a canary). Both are zero-downtime deployment strategies. Blue-green is simpler and more appropriate for batch jobs or services with stateful sessions; canary is more appropriate for stateless APIs where gradual traffic shifting is easy.

Question 2

How do you ensure database schema changes are compatible with canary deployments?

Accepted Answer

During a canary, both old and new application code run simultaneously. A schema change that removes a column breaks the old code; adding a column with a NOT NULL default breaks old code that doesn't supply it. Compatibility rules for canary deployments: (1) only additive changes during a canary window: add columns (nullable), add tables, add indexes; (2) never remove or rename during a canary: the old code still uses the old name; (3) use the expand/contract pattern: add new column → deploy canary (writes both old and new) → promote → drop old column in a later migration; (4) API responses: the new version's responses should include all fields the old version expected — add new fields but don't remove existing ones during the canary window. Gate any breaking schema changes behind a feature flag that is OFF during the canary and only enabled after full promotion.

Question 3

How do you handle stateful sessions during a canary deployment?

Accepted Answer

Session stickiness: if a user's first request hits the canary, route all subsequent requests in the same session to the canary (until the session expires). Without stickiness, a user might get canary behavior on one page and stable behavior on the next — creating inconsistent experiences. Implementation: set a cookie (X-Canary-Version: canary or stable) on the first request and use that cookie to route subsequent requests to the same version. The load balancer checks the cookie and routes accordingly. Stateless APIs (JWT authentication, no server-side sessions) don't need stickiness — each request is independent and it is acceptable for the same user to hit different versions on different requests, as long as the API response contract is consistent between versions.

Question 4

What metrics should trigger an automatic rollback during a canary?

Accepted Answer

Rollback guardrails in priority order: (1) error rate: canary error rate > baseline error rate by more than 20% relative. This is the most important signal — errors are immediately user-visible. Threshold: max_delta_pct=20, metric_name='error_rate'; (2) p99 latency: canary p99 exceeds baseline by more than 50% (users experiencing slow responses). max_delta_pct=50, metric_name='p99_latency_ms'; (3) business metrics: conversion rate for canary users vs. control — if checkout conversion drops by more than 5% relative, rollback. This requires instrumenting business KPIs as custom metric samples; (4) hard caps: absolute max regardless of baseline — error rate > 2% (even if baseline is 2%, a canary at 4% is unacceptable), p99 > 5,000ms. Don't rollback on: minor latency increases within variance (p99 5% higher — noise), CPU/memory spikes without user-visible impact. Require minimum 10 samples before evaluating to avoid false positives from initial traffic ramp-up.

Question 5

How do you implement a gradual promotion (0% → 1% → 5% → 25% → 100%) with automatic advancement?

Accepted Answer

Progressive delivery: define advancement stages as a schedule in the Deployment record or a separate CanarySchedule table: [{pct:1, soak_minutes:10}, {pct:5, soak_minutes:15}, {pct:25, soak_minutes:30}, {pct:100, soak_minutes:0}]. The evaluate_canary() job runs every minute. When all guardrails pass and the current stage's soak_minutes have elapsed: advance to the next stage by updating canary_pct and updating the load balancer's routing weights. Implementation: the advancement check adds: elapsed_at_current_pct = (NOW() - last_pct_change_at).total_seconds() / 60. If elapsed >= stage.soak_minutes and guardrails pass: advance to next stage. If the service handles 10K RPM and canary_pct=1, the canary gets 100 RPM — enough to accumulate meaningful error rate statistics within 10 minutes. At 5%, 500 RPM provides faster signal accumulation.

Canary Deployment System Low-Level Design: Traffic Splitting, Guardrail Evaluation, and Automated Rollback

Canary Deployment System: Low-Level Design

Core Data Model

Traffic Splitting

Guardrail Evaluation

Key Design Decisions