Low Level Design: Blue-Green and Canary Deployments

⏱ 2 min read

Deployment strategies control how new software versions are rolled out to production. Blue-green and canary deployments minimize risk by limiting the blast radius of bad releases. The choice between them depends on whether you want instant rollback capability or gradual validation with real traffic.

Blue-Green Deployment

Maintain two identical production environments: blue (current version) and green (new version). Deploy the new version to the green environment. Run smoke tests and verification against green. Switch the load balancer to route all traffic to green. If problems are detected, switch back to blue instantly. Blue remains on standby until green is confirmed stable, then blue becomes the next deployment target.

Blue-Green Tradeoffs

Advantages: instant cutover, instant rollback, zero-downtime deployment, full production validation before cutover. Disadvantages: requires double the infrastructure capacity during deployment, database schema changes must be backward-compatible with both versions simultaneously, stateful services (sessions, connections) must handle the cutover without dropping requests.

Canary Deployment

Route a small percentage of traffic (1-5%) to the new version while keeping the rest on the stable version. Monitor error rates, latency, and business metrics for the canary cohort. If metrics are healthy, gradually increase the percentage (5% → 20% → 50% → 100%). If problems are detected at any stage, route traffic back to the stable version. The blast radius is limited to the canary percentage.

Traffic Splitting

Traffic splitting is implemented at the load balancer or API gateway: weighted round-robin (1% to canary, 99% to stable), header-based routing (X-Canary: true routes to canary for internal testers), user-ID-based routing (consistent hash ensures a given user always hits the same version). Service meshes (Istio, Linkerd) provide first-class canary routing via VirtualService weight configuration.

Automated Canary Analysis

Automated canary analysis (Kayenta, Spinnaker) compares metrics between canary and baseline versions: error rate, p99 latency, conversion rate. Statistical comparison determines if the canary is significantly worse than baseline. If the canary passes the analysis threshold, promotion continues automatically. If it fails, automatic rollback is triggered. This eliminates manual judgment calls during progressive rollouts.

Database Schema Compatibility

Both blue-green and canary require the database schema to be compatible with both old and new application versions simultaneously. The expand-contract migration pattern: first expand the schema (add new column, make it nullable), deploy the new version that writes to both old and new columns, then contract (remove the old column) after all instances run the new version. Never deploy a schema change and application change atomically.

Feature Flags vs Canary

Feature flags and canary deployments are complementary. Canary controls which deployment version serves traffic. Feature flags control which code paths execute within a version. Canary is an infrastructure-level mechanism; feature flags are application-level. A feature can be deployed to 100% of instances (deployment) but only enabled for 1% of users (feature flag). Separating deployment from release reduces coordination overhead.

Rollback Strategy

Blue-green rollback: flip the load balancer back to blue. Canary rollback: reduce canary weight to 0%. Both require that the previous version is still running and ready to serve traffic. Automated rollback triggers: error rate exceeds threshold for N consecutive minutes, p99 latency increases by X% relative to baseline, canary analysis fails. Rollback should be a one-click or automated operation, not a manual redeployment.