Zero-downtime deployment ensures users experience no interruption when new code is released. This is a requirement for any production system with SLA commitments. This guide covers the deployment strategies used by companies like Google, Netflix, and Amazon to ship code hundreds of times per day without downtime — essential knowledge for system design and SRE interviews.
Blue-Green Deployment
Blue-green deployment maintains two identical production environments. The blue environment runs the current version; the green environment is idle. Deployment process: (1) Deploy the new version to the green environment. (2) Run smoke tests against green. (3) Switch the load balancer to route traffic from blue to green. (4) Monitor green for errors. If problems are detected, switch traffic back to blue (instant rollback). (5) After confidence period, decommission blue (or keep it for the next deployment cycle). Advantages: instant rollback (just switch the load balancer back), full environment testing before traffic switch, no mixed-version traffic. Disadvantages: requires double the infrastructure (expensive), database migrations must be compatible with both versions (both blue and green connect to the same database), and the traffic switch is all-or-nothing (100% of users get the new version at once — no gradual rollout).
Canary Deployment
Canary deployment gradually shifts traffic to the new version. Process: (1) Deploy the new version to a small subset of instances (1-5% of capacity). (2) Route 1-5% of traffic to the canary instances. (3) Monitor key metrics: error rate, latency P50/P95/P99, business metrics (conversion rate, revenue). (4) If metrics are healthy after the observation window (10-30 minutes), increase traffic to 25%, then 50%, then 100%. (5) If metrics degrade at any stage, route all traffic back to the old version and investigate. Canary advantages: limits blast radius (a bug affects only 1-5% of users during the canary phase), provides real production validation, and gradual rollout allows early detection. Implementation: Kubernetes supports canary deployment with traffic splitting using Istio, Linkerd, or Argo Rollouts. Argo Rollouts automates the canary process: define traffic percentages and observation windows in a Rollout resource, and Argo automatically promotes or rolls back based on metric thresholds.
Rolling Updates in Kubernetes
Rolling updates replace old pods with new pods one at a time. Kubernetes Deployment configuration: maxSurge (how many extra pods to create during the update, default 25%) and maxUnavailable (how many pods can be down during the update, default 25%). For a deployment with 10 replicas: maxSurge=2 means create up to 12 pods during rollout. maxUnavailable=1 means at least 9 pods must be running at all times. The rollout creates new pods, waits for them to pass readiness checks, then terminates old pods. Readiness probes are critical: they determine when a new pod is ready to receive traffic. A pod that starts but is not yet connected to the database or has not loaded its cache should not receive traffic. Liveness probes restart unhealthy pods. Graceful shutdown: when a pod is terminated, Kubernetes sends SIGTERM and waits for terminationGracePeriodSeconds (default 30s). The application should stop accepting new requests, finish in-flight requests, close database connections, and exit. Configure preStop hooks if the application needs extra shutdown time.
Feature Flags for Deployment Decoupling
Feature flags separate deployment from release. Deploy code to production with the feature hidden behind a flag. Enable the flag for internal users first, then beta users, then 1% of all users, then 100%. This decouples the deployment schedule from the feature release schedule. Teams can merge code continuously without coordinating release dates. Implementation: a feature flag service (LaunchDarkly, Unleash, or custom) stores flag state. The application checks the flag before executing the new code path. Flag evaluation context includes user ID, user attributes (plan, country, device), and percentage-based targeting. Kill switch: if a newly released feature causes problems, disable the flag immediately — no deployment required. The latency from flag change to effect is seconds (the SDK polls or receives a push update). Technical debt: feature flags must be cleaned up after the feature is fully rolled out. Stale flags add complexity and dead code. Track flag creation date and set a cleanup reminder.
Database Migrations During Zero-Downtime Deploys
The database is the hardest part of zero-downtime deployment. During a rolling update, both old and new application versions run simultaneously and connect to the same database. Migrations must be compatible with both versions. Safe deployment order: (1) Deploy a migration that adds new columns or tables (backward-compatible). (2) Deploy the new application version that writes to both old and new columns. (3) Run a backfill to populate the new column for existing rows. (4) Deploy the application version that reads from the new column. (5) Deploy a migration that drops the old column (after verifying no code references it). Each step is a separate deployment with its own monitoring window. Never combine schema changes and application changes in the same deployment — if you need to roll back the application, the schema change may prevent the old version from working.
Deployment Observability
Deployment observability answers: “is the new version healthy?” Required signals: (1) Error rate — compare the new version error rate against the old version baseline. A 2x increase in error rate triggers automatic rollback. (2) Latency — compare P50, P95, P99 latency. Latency regression may indicate a performance bug, missing index, or N+1 query. (3) Business metrics — conversion rate, order completion rate, revenue per minute. A code change that reduces error rate but drops conversion rate has a bug. (4) Resource usage — CPU, memory, and network usage of the new version. A memory leak shows as gradually increasing memory usage after deployment. (5) Deployment annotations — mark deployment events on Grafana dashboards so any metric change can be correlated with a specific deployment. Automated rollback: Argo Rollouts, Spinnaker, and custom deployment pipelines support metric-based automatic rollback. Define rollback thresholds (error rate > 1%, P99 > 500ms) and the system reverts without human intervention.
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the difference between blue-green and canary deployments?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Blue-green deployment switches 100% of traffic from the old version (blue) to the new version (green) at once. You maintain two complete production environments. After deploying to green and running smoke tests, the load balancer switches all traffic. Rollback is instant — switch back to blue. Disadvantage: no gradual validation; all users get the new version simultaneously, so a bug affects everyone. Canary deployment gradually shifts traffic: 1% to the new version, then 5%, 10%, 25%, 50%, 100%. At each stage, metrics are compared between the canary (new version) and the baseline (old version). If the canary shows higher error rates or latency, traffic is rolled back to 0% on the new version. Advantage: limits blast radius; a bug in the canary affects only a small percentage of users. Disadvantage: both versions run simultaneously for an extended period, which complicates stateful operations and database migrations. Choose blue-green for simpler applications with fast smoke tests. Choose canary for high-traffic services where gradual validation reduces risk. Many organizations use canary as the default and reserve blue-green for infrastructure changes.” } }, { “@type”: “Question”, “name”: “How do Kubernetes readiness and liveness probes prevent downtime during deployments?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Readiness probes determine when a pod is ready to receive traffic. During a rolling update, Kubernetes creates new pods but does not route traffic to them until the readiness probe succeeds. This prevents users from hitting a pod that is still starting (loading configuration, warming caches, establishing database connections). Configure the readiness probe to check a health endpoint that verifies all dependencies are connected: GET /health/ready returns 200 only when the database connection pool is established, the cache is warmed, and the application is fully initialized. Liveness probes determine when a pod is unhealthy and needs to be restarted. If a liveness probe fails consecutively (failureThreshold times), Kubernetes kills and restarts the pod. This handles deadlocks, memory leaks, and stuck processes. Critical deployment settings: initialDelaySeconds (wait before first probe — give the application time to start), periodSeconds (probe frequency), failureThreshold (consecutive failures before action). Common mistake: using the same endpoint for both probes. The liveness probe should check basic process health (is the HTTP server responding?). The readiness probe should check dependency connectivity (is the database reachable?). A database outage should make pods not-ready (stop receiving traffic) but not restart them (the database being down is not fixed by restarting the application).” } }, { “@type”: “Question”, “name”: “How do you handle graceful shutdown during Kubernetes pod termination?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “When Kubernetes terminates a pod (during a rolling update or scale-down), it sends SIGTERM to the process and waits for terminationGracePeriodSeconds (default 30 seconds). During this period, the application must: (1) Stop accepting new requests — the pod is removed from the Service endpoints, but in-flight requests from the load balancer may still arrive for a few seconds due to propagation delay. Add a preStop hook with a short sleep (5 seconds) to allow the endpoint removal to propagate. (2) Complete in-flight requests — finish processing requests that are already being handled. Do not drop active connections. (3) Close resources cleanly — close database connection pools, flush log buffers, finish writing to message queues, and acknowledge pending messages. (4) Exit with code 0. If the process does not exit within terminationGracePeriodSeconds, Kubernetes sends SIGKILL (cannot be caught). For long-running operations (video transcoding, large file uploads), increase terminationGracePeriodSeconds or implement checkpointing so the operation can be resumed by another pod. Node.js: handle SIGTERM with process.on SIGTERM. Java/Spring: use @PreDestroy. Go: use signal.Notify for os.Interrupt and syscall.SIGTERM.” } }, { “@type”: “Question”, “name”: “How do feature flags decouple deployment from release?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Feature flags (feature toggles) allow code to be deployed to production in a disabled state and enabled independently of the deployment process. Deployment is pushing code to servers. Release is making functionality available to users. With feature flags, these are separate events. Deployment happens continuously (multiple times per day via CI/CD). Release happens when the team decides the feature is ready — by toggling a flag, not by deploying. Implementation: wrap new code paths in a flag check: if feature_flags.is_enabled(new_checkout_flow, user_context) then execute the new code path, else execute the old code path. The flag evaluation can target: specific users (internal testers), percentage of traffic (1% canary), user attributes (premium plan, specific country), or all users (full rollout). Kill switch: if the new feature causes problems after release, disable the flag immediately. Effect is near-instant (seconds) compared to a rollback deployment (minutes to hours). Lifecycle: create the flag, develop behind it, test internally, canary to 1%, ramp to 100%, remove the flag and dead code. Critical: set a cleanup deadline when creating the flag. Stale flags accumulate as technical debt — unused code paths that are never executed but must be maintained.” } } ] }