Low Level Design: Zero-Downtime Deployment

⏱ 3 min read

Zero-downtime deployment updates production services without dropping user requests. Modern techniques — rolling updates, blue-green deployments, and canary releases — decouple the code deployment from traffic switching, allowing verification before full exposure and instant rollback without re-deploying the old version. The key challenges are: draining in-flight requests, backward compatibility during the transition period, and database schema changes.

Rolling Updates

Rolling updates replace instances one (or a few) at a time. Kubernetes rolling update: replace 25% of pods with the new version, wait for them to be ready, replace the next 25%, and so on. During the rollout, both old and new versions are live simultaneously — the load balancer routes traffic to both. This requires backward compatibility: the new version must handle requests from clients that were prepared by the old version, and the old version must handle responses from the new version. Database changes must be compatible with both versions during the transition. Rolling updates are simple to operate but the transition period where two versions coexist requires careful compatibility management.

Blue-Green Deployment

Blue-green deployment runs two identical production environments: blue (current) and green (new version). Deploy the new version to the green environment, run smoke tests, then switch traffic from blue to green at the load balancer or DNS level. All traffic switches instantly — there is no transition period where both versions coexist. Rollback: switch traffic back to blue (seconds, no re-deployment). Cost: blue-green requires double the infrastructure during the transition. Database: both environments must share the same database (or the database migration must be compatible with both versions). Best for services where any version coexistence is unacceptable.

Canary Deployment

Canary deployment routes a small percentage of production traffic to the new version before full rollout. Route 1% of requests to v2 (the canary); monitor error rate, latency, and business metrics. If metrics are healthy, increase to 5%, 10%, 25%, 50%, 100%. If metrics degrade, route 0% to v2 (instant rollback) without re-deployment. Canary is the safest deployment strategy: real production traffic validates the new version at small scale before full exposure. Implement with a service mesh (Istio weight-based routing) or an API gateway that supports weighted routing. Canary-specific metrics: compare v1 vs v2 error rates side by side on the same dashboard.

Database Schema Migrations

Database migrations are the hardest part of zero-downtime deployments. The expand-contract pattern handles backward-incompatible schema changes in three phases. Expand: add the new column/table (backward compatible — the old version ignores new columns; the new version populates them). Transition: both versions run simultaneously; data is written to both old and new columns. Contract: remove the old column/table once the old version is fully replaced. Never rename or remove a column in a single migration with a code deploy — the old version breaks immediately if it references a now-deleted column.

Graceful Shutdown

When a pod is terminated during a rolling update, it must drain in-flight requests before shutting down. Kubernetes sends SIGTERM to the container, then waits terminationGracePeriodSeconds (default 30s) before SIGKILL. The application should: stop accepting new connections on SIGTERM, finish processing in-flight requests, and exit cleanly. Configure the HTTP server with a shutdown timeout: server.Shutdown(ctx) with a 25-second timeout (leave 5s margin before SIGKILL). The load balancer removes the pod from rotation before SIGTERM (preStop hook delay of 5-10 seconds) to prevent new traffic from being routed to a shutting-down pod.

Health Checks for Deployment Readiness

Kubernetes readiness probes determine when a new pod is ready to receive traffic. A pod is not added to the load balancer pool until its readiness probe passes. Readiness probe: an HTTP endpoint (/ready) that returns 200 only when the service is fully initialized (caches warmed, database connections established, configuration loaded). Liveness probe: returns 200 as long as the process is alive (used to restart stuck pods, not to remove them from rotation). Startup probe: allows slow-starting applications to have a longer initial health check window without triggering liveness restarts. Correct readiness probe configuration prevents traffic from routing to pods that are still initializing.

Feature Flags for Deployment Safety

Feature flags decouple code deployment from feature activation, providing a safety layer for zero-downtime deployments. Deploy the new code with the feature disabled (flag off) — the new code is inert. Verify the deployment is healthy (no new errors, latency unchanged). Enable the flag for internal users to test with real data. Gradually enable for external users. If issues arise, disable the flag instantly — no re-deployment needed. This is the safest deployment pattern: the deployment risk (new code in production) is separated from the feature risk (new behavior exposed to users). Rollback is a configuration change, not a code change.