System Design: Zero-Downtime Deployment — Blue-Green, Canary, Rolling Updates, Kubernetes, Feature Flags

Zero-downtime deployment ensures users experience no interruption when new code is released. This is a requirement for any production system with SLA commitments. This guide covers the deployment strategies used by companies like Google, Netflix, and Amazon to ship code hundreds of times per day without downtime — essential knowledge for system design and SRE interviews.

Blue-Green Deployment

Blue-green deployment maintains two identical production environments. The blue environment runs the current version; the green environment is idle. Deployment process: (1) Deploy the new version to the green environment. (2) Run smoke tests against green. (3) Switch the load balancer to route traffic from blue to green. (4) Monitor green for errors. If problems are detected, switch traffic back to blue (instant rollback). (5) After confidence period, decommission blue (or keep it for the next deployment cycle). Advantages: instant rollback (just switch the load balancer back), full environment testing before traffic switch, no mixed-version traffic. Disadvantages: requires double the infrastructure (expensive), database migrations must be compatible with both versions (both blue and green connect to the same database), and the traffic switch is all-or-nothing (100% of users get the new version at once — no gradual rollout).

Canary Deployment

Canary deployment gradually shifts traffic to the new version. Process: (1) Deploy the new version to a small subset of instances (1-5% of capacity). (2) Route 1-5% of traffic to the canary instances. (3) Monitor key metrics: error rate, latency P50/P95/P99, business metrics (conversion rate, revenue). (4) If metrics are healthy after the observation window (10-30 minutes), increase traffic to 25%, then 50%, then 100%. (5) If metrics degrade at any stage, route all traffic back to the old version and investigate. Canary advantages: limits blast radius (a bug affects only 1-5% of users during the canary phase), provides real production validation, and gradual rollout allows early detection. Implementation: Kubernetes supports canary deployment with traffic splitting using Istio, Linkerd, or Argo Rollouts. Argo Rollouts automates the canary process: define traffic percentages and observation windows in a Rollout resource, and Argo automatically promotes or rolls back based on metric thresholds.

Rolling Updates in Kubernetes

Rolling updates replace old pods with new pods one at a time. Kubernetes Deployment configuration: maxSurge (how many extra pods to create during the update, default 25%) and maxUnavailable (how many pods can be down during the update, default 25%). For a deployment with 10 replicas: maxSurge=2 means create up to 12 pods during rollout. maxUnavailable=1 means at least 9 pods must be running at all times. The rollout creates new pods, waits for them to pass readiness checks, then terminates old pods. Readiness probes are critical: they determine when a new pod is ready to receive traffic. A pod that starts but is not yet connected to the database or has not loaded its cache should not receive traffic. Liveness probes restart unhealthy pods. Graceful shutdown: when a pod is terminated, Kubernetes sends SIGTERM and waits for terminationGracePeriodSeconds (default 30s). The application should stop accepting new requests, finish in-flight requests, close database connections, and exit. Configure preStop hooks if the application needs extra shutdown time.

Feature Flags for Deployment Decoupling

Feature flags separate deployment from release. Deploy code to production with the feature hidden behind a flag. Enable the flag for internal users first, then beta users, then 1% of all users, then 100%. This decouples the deployment schedule from the feature release schedule. Teams can merge code continuously without coordinating release dates. Implementation: a feature flag service (LaunchDarkly, Unleash, or custom) stores flag state. The application checks the flag before executing the new code path. Flag evaluation context includes user ID, user attributes (plan, country, device), and percentage-based targeting. Kill switch: if a newly released feature causes problems, disable the flag immediately — no deployment required. The latency from flag change to effect is seconds (the SDK polls or receives a push update). Technical debt: feature flags must be cleaned up after the feature is fully rolled out. Stale flags add complexity and dead code. Track flag creation date and set a cleanup reminder.

Database Migrations During Zero-Downtime Deploys

The database is the hardest part of zero-downtime deployment. During a rolling update, both old and new application versions run simultaneously and connect to the same database. Migrations must be compatible with both versions. Safe deployment order: (1) Deploy a migration that adds new columns or tables (backward-compatible). (2) Deploy the new application version that writes to both old and new columns. (3) Run a backfill to populate the new column for existing rows. (4) Deploy the application version that reads from the new column. (5) Deploy a migration that drops the old column (after verifying no code references it). Each step is a separate deployment with its own monitoring window. Never combine schema changes and application changes in the same deployment — if you need to roll back the application, the schema change may prevent the old version from working.

Deployment Observability

Deployment observability answers: “is the new version healthy?” Required signals: (1) Error rate — compare the new version error rate against the old version baseline. A 2x increase in error rate triggers automatic rollback. (2) Latency — compare P50, P95, P99 latency. Latency regression may indicate a performance bug, missing index, or N+1 query. (3) Business metrics — conversion rate, order completion rate, revenue per minute. A code change that reduces error rate but drops conversion rate has a bug. (4) Resource usage — CPU, memory, and network usage of the new version. A memory leak shows as gradually increasing memory usage after deployment. (5) Deployment annotations — mark deployment events on Grafana dashboards so any metric change can be correlated with a specific deployment. Automated rollback: Argo Rollouts, Spinnaker, and custom deployment pipelines support metric-based automatic rollback. Define rollback thresholds (error rate > 1%, P99 > 500ms) and the system reverts without human intervention.

Scroll to Top