System Design Interview: Zero-Downtime Deployments

⏱ 10 min read

Why Zero-Downtime Deployments

Planned downtime windows are increasingly unacceptable — global users span every time zone, and even brief outages cost revenue and erode trust. Netflix, Amazon, and Google deploy thousands of times per day with zero planned downtime. Zero-downtime deployment requires careful coordination between application servers, databases, load balancers, and feature flags. Each component has its own challenge.

Rolling Updates

The simplest approach. Instead of taking all servers offline and deploying simultaneously, update servers one at a time (or in small batches). The load balancer removes a server from rotation, deploys the new version, health-checks it, returns it to rotation, then moves to the next server.

Kubernetes supports rolling updates natively: strategy.type: RollingUpdate with maxSurge: 1 (one extra pod) and maxUnavailable: 0 (never reduce capacity)
During the rollout, old and new versions run simultaneously — the application must be backward-compatible
Risk: a bug in the new version gradually impacts more users before it is caught
Rollback: update the deployment image back to the previous version — Kubernetes handles the reverse rolling update

Blue-Green Deployments

Maintain two identical environments: Blue (currently serving traffic) and Green (idle). Deploy the new version to Green, run tests, then switch the load balancer to route all traffic to Green. Blue becomes the standby.


# Nginx upstream switch (simplified):
upstream app {
    server green-cluster:8080;  # switch between blue/green here
}

# AWS ALB target group switch:
aws elbv2 modify-listener --listener-arn $ARN 
    --default-actions Type=forward,TargetGroupArn=$GREEN_TG_ARN

Advantages: instant cutover (no gradual rollout); instant rollback (switch back to Blue); full validation before any user sees new code. Disadvantages: requires double the infrastructure during deployment; long-running requests in flight when switching; database schema must support both versions simultaneously (since Blue may still hold active sessions/transactions).

Canary Deployments

Gradually shift traffic to the new version, starting with a small percentage (1-5%) and increasing as confidence grows. Monitor error rates, latency, and business metrics at each step.


# Kubernetes traffic splitting with Nginx Ingress:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "5"  # 5% to new version

# AWS App Mesh / Istio VirtualService:
spec:
  http:
  - route:
    - destination:
        host: app-v2
        port: 8080
      weight: 5
    - destination:
        host: app-v1
        port: 8080
      weight: 95

Canary is safer than blue-green for high-traffic systems — a bug affects 5% of users instead of 100%. Automated canary analysis (Flagger, Spinnaker) monitors error rates and latency at each percentage and automatically rolls back if thresholds are exceeded. Increase weight: 5% → 20% → 50% → 100% over 30-60 minutes.

Feature Flags

Deploy code to production but control which users see the new feature via runtime flags. The deployment is separate from the release. This allows: dark launches (code in prod, feature off), percentage rollouts (enable for 1% of users), A/B testing (different users see different variants), emergency kill switches (disable a problematic feature without deploying).


# LaunchDarkly SDK example:
if ld_client.variation("new-checkout-flow", user, false)
    render :new_checkout
else
    render :checkout
end

# Internal flag evaluation (database or Redis):
def flag_enabled?(flag_name, user_id)
    flag = FlagStore.get(flag_name)
    return false unless flag.enabled
    # Hash-based deterministic rollout: same user always sees same variant
    bucket = Digest::MD5.hexdigest("#{flag_name}#{user_id}").to_i(16) % 100
    bucket < flag.rollout_percentage
end

Database Schema Migrations: The Hardest Part

Application code can be rolled back in seconds, but database schema changes are harder to reverse and must be compatible with both old and new code during the deployment window. The Expand-Contract (Parallel Change) pattern:

Adding a column (safe)

Expand: add the new column as nullable with no default. PostgreSQL 11+ makes this instant (no table rewrite). Old code ignores it; new code writes to it.
Migrate: backfill existing rows in small batches with a background job. Do not do a single UPDATE on a large table — it locks.
Contract: once all rows are backfilled and new code is deployed, add NOT NULL constraint and drop the old column.

Renaming a column (safe via double-write)

Add new column (name_v2)
Deploy code that writes to both old and new column
Backfill name_v2 from name
Deploy code that reads from new column (name_v2)
Deploy code that writes only to new column
Drop old column

Unsafe operations to avoid during live deployments

ALTER TABLE ADD COLUMN NOT NULL without default — scans all rows (use nullable first)
Adding a unique index without CONCURRENTLY — locks the table
Dropping a column still referenced by running application code
Changing a column type — requires full table rewrite


-- PostgreSQL: always use CONCURRENTLY for index creation on live tables
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Takes longer but does not lock the table

Health Checks and Readiness Probes

Load balancers and Kubernetes use health checks to determine when a new instance is ready to serve traffic and when to stop sending traffic to a failing instance:

Liveness probe: is the process alive? If not, Kubernetes restarts the container. Use a lightweight endpoint (/healthz) that returns 200 as long as the process is running.
Readiness probe: is the instance ready to serve traffic? Check that database connections are available, caches are warmed, and critical dependencies are reachable. Remove the pod from the Service endpoint if readiness fails — no new requests. Restore when ready.
Startup probe: for slow-starting applications (JVM, model loading), allows a longer initial startup period before liveness checks begin.

Graceful Shutdown

When a pod receives SIGTERM (Kubernetes termination signal), it should: stop accepting new connections, finish processing in-flight requests, close database connections, then exit. Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds (default 30s), then sends SIGKILL. Configure terminationGracePeriodSeconds to match your longest expected request duration. Application code should catch SIGTERM:


# Node.js graceful shutdown:
process.on('SIGTERM', () => {
    server.close(() => {
        db.end();          // close DB connections
        process.exit(0);   // clean exit
    });
    // If not closed within 10s, force exit:
    setTimeout(() => process.exit(1), 10000);
});

Key Interview Points

Rolling updates are simplest; blue-green gives instant cutover; canary minimizes blast radius
Old and new code run simultaneously during any zero-downtime deployment — API and DB changes must be backward-compatible
Use Expand-Contract for schema migrations: add nullable column → backfill → add NOT NULL constraint
Always create indexes with CONCURRENTLY in PostgreSQL to avoid locking
Feature flags decouple deployment from release — enable kill switches in production
Readiness probes prevent traffic reaching pods that are not fully initialized

Frequently Asked Questions

What is the difference between blue-green and canary deployments?

Blue-green deployment maintains two complete, identical environments. At cutover, 100% of traffic switches from the old (blue) environment to the new (green) environment instantaneously. This provides instant rollback (switch back to blue) and complete isolation (green is fully tested before any user sees it). Downside: requires double infrastructure during deployment; the switch is all-or-nothing so if there is a bug, 100% of users are affected until rollback. Canary deployment gradually shifts traffic from old to new version — starting with a small percentage (1-5%) and increasing as confidence builds. A bug in the new version only affects the canary percentage of users. Automated analysis (Flagger, Spinnaker) can roll back automatically if error rate or latency degrades. Canary is safer for high-traffic systems because the blast radius of a bad deploy is limited. Downside: takes longer (30-60 minutes for a full rollout vs instant blue-green cutover); old and new versions must coexist for an extended period, requiring API backward compatibility. Most large-scale systems prefer canary; blue-green is simpler and works well for smaller systems or where instant rollback is paramount.

How do you handle database schema changes safely during a zero-downtime deployment?

The core challenge is that during a rolling or canary deployment, old and new application code run simultaneously against the same database. If new code expects a column that old code does not write, or if you drop a column still read by old code, you get errors. The Expand-Contract (parallel change) pattern solves this: to add a required column, first add it as nullable (instant DDL in PostgreSQL 11+, no table lock). Old code runs fine — it ignores the new column. New code writes to the new column. Once all traffic runs on new code, backfill null rows and add the NOT NULL constraint. To rename a column: add the new column, deploy code that double-writes to both, backfill historical rows, deploy code that reads from the new column, verify, then drop the old column. Critical PostgreSQL rules: always create indexes with CREATE INDEX CONCURRENTLY (non-blocking, takes longer but does not lock). Never run ALTER TABLE operations that require full table rewrites (type changes, adding NOT NULL without default) on large tables during a deployment — schedule these during low-traffic windows or use pg_repack.

What is a readiness probe and how does it enable zero-downtime deployments in Kubernetes?

A readiness probe is a health check that Kubernetes uses to determine whether a pod is ready to receive traffic. Unlike a liveness probe (is the process alive?), a readiness probe asks: is this instance fully initialized and capable of serving requests correctly? During a zero-downtime rolling deployment: Kubernetes creates a new pod, but does NOT add it to the Service's endpoint list until the readiness probe succeeds. The probe might check: is the database connection pool established? Is the cache warmed? Are all dependencies reachable? Only when these pass does Kubernetes route traffic to the new pod. Simultaneously, the old pod is kept in rotation until the new one is ready. If the readiness probe fails after the pod starts serving traffic (e.g., database becomes unreachable), Kubernetes removes the pod from rotation — no more traffic until it recovers. This prevents the all-too-common issue of traffic routing to a pod that is starting up (before it has loaded its code and connected to databases) and returning 502 errors to users. Configure readiness probes with appropriate initialDelaySeconds (skip early checks during startup) and failureThreshold (how many consecutive failures before removing from rotation).

Companies That Ask This Question

Atlassian Engineering Interview Guide

HashiCorp Engineering Interview Guide

LinkedIn Engineering Interview Guide

Shopify Engineering Interview Guide