Why Zero-Downtime Deployments
Planned downtime windows are increasingly unacceptable — global users span every time zone, and even brief outages cost revenue and erode trust. Netflix, Amazon, and Google deploy thousands of times per day with zero planned downtime. Zero-downtime deployment requires careful coordination between application servers, databases, load balancers, and feature flags. Each component has its own challenge.
Rolling Updates
The simplest approach. Instead of taking all servers offline and deploying simultaneously, update servers one at a time (or in small batches). The load balancer removes a server from rotation, deploys the new version, health-checks it, returns it to rotation, then moves to the next server.
- Kubernetes supports rolling updates natively:
strategy.type: RollingUpdatewithmaxSurge: 1(one extra pod) andmaxUnavailable: 0(never reduce capacity) - During the rollout, old and new versions run simultaneously — the application must be backward-compatible
- Risk: a bug in the new version gradually impacts more users before it is caught
- Rollback: update the deployment image back to the previous version — Kubernetes handles the reverse rolling update
Blue-Green Deployments
Maintain two identical environments: Blue (currently serving traffic) and Green (idle). Deploy the new version to Green, run tests, then switch the load balancer to route all traffic to Green. Blue becomes the standby.
# Nginx upstream switch (simplified):
upstream app {
server green-cluster:8080; # switch between blue/green here
}
# AWS ALB target group switch:
aws elbv2 modify-listener --listener-arn $ARN
--default-actions Type=forward,TargetGroupArn=$GREEN_TG_ARN
Advantages: instant cutover (no gradual rollout); instant rollback (switch back to Blue); full validation before any user sees new code. Disadvantages: requires double the infrastructure during deployment; long-running requests in flight when switching; database schema must support both versions simultaneously (since Blue may still hold active sessions/transactions).
Canary Deployments
Gradually shift traffic to the new version, starting with a small percentage (1-5%) and increasing as confidence grows. Monitor error rates, latency, and business metrics at each step.
# Kubernetes traffic splitting with Nginx Ingress:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "5" # 5% to new version
# AWS App Mesh / Istio VirtualService:
spec:
http:
- route:
- destination:
host: app-v2
port: 8080
weight: 5
- destination:
host: app-v1
port: 8080
weight: 95
Canary is safer than blue-green for high-traffic systems — a bug affects 5% of users instead of 100%. Automated canary analysis (Flagger, Spinnaker) monitors error rates and latency at each percentage and automatically rolls back if thresholds are exceeded. Increase weight: 5% → 20% → 50% → 100% over 30-60 minutes.
Feature Flags
Deploy code to production but control which users see the new feature via runtime flags. The deployment is separate from the release. This allows: dark launches (code in prod, feature off), percentage rollouts (enable for 1% of users), A/B testing (different users see different variants), emergency kill switches (disable a problematic feature without deploying).
# LaunchDarkly SDK example:
if ld_client.variation("new-checkout-flow", user, false)
render :new_checkout
else
render :checkout
end
# Internal flag evaluation (database or Redis):
def flag_enabled?(flag_name, user_id)
flag = FlagStore.get(flag_name)
return false unless flag.enabled
# Hash-based deterministic rollout: same user always sees same variant
bucket = Digest::MD5.hexdigest("#{flag_name}#{user_id}").to_i(16) % 100
bucket < flag.rollout_percentage
end
Database Schema Migrations: The Hardest Part
Application code can be rolled back in seconds, but database schema changes are harder to reverse and must be compatible with both old and new code during the deployment window. The Expand-Contract (Parallel Change) pattern:
Adding a column (safe)
- Expand: add the new column as nullable with no default. PostgreSQL 11+ makes this instant (no table rewrite). Old code ignores it; new code writes to it.
- Migrate: backfill existing rows in small batches with a background job. Do not do a single UPDATE on a large table — it locks.
- Contract: once all rows are backfilled and new code is deployed, add NOT NULL constraint and drop the old column.
Renaming a column (safe via double-write)
- Add new column (name_v2)
- Deploy code that writes to both old and new column
- Backfill name_v2 from name
- Deploy code that reads from new column (name_v2)
- Deploy code that writes only to new column
- Drop old column
Unsafe operations to avoid during live deployments
- ALTER TABLE ADD COLUMN NOT NULL without default — scans all rows (use nullable first)
- Adding a unique index without CONCURRENTLY — locks the table
- Dropping a column still referenced by running application code
- Changing a column type — requires full table rewrite
-- PostgreSQL: always use CONCURRENTLY for index creation on live tables
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Takes longer but does not lock the table
Health Checks and Readiness Probes
Load balancers and Kubernetes use health checks to determine when a new instance is ready to serve traffic and when to stop sending traffic to a failing instance:
- Liveness probe: is the process alive? If not, Kubernetes restarts the container. Use a lightweight endpoint (
/healthz) that returns 200 as long as the process is running. - Readiness probe: is the instance ready to serve traffic? Check that database connections are available, caches are warmed, and critical dependencies are reachable. Remove the pod from the Service endpoint if readiness fails — no new requests. Restore when ready.
- Startup probe: for slow-starting applications (JVM, model loading), allows a longer initial startup period before liveness checks begin.
Graceful Shutdown
When a pod receives SIGTERM (Kubernetes termination signal), it should: stop accepting new connections, finish processing in-flight requests, close database connections, then exit. Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds (default 30s), then sends SIGKILL. Configure terminationGracePeriodSeconds to match your longest expected request duration. Application code should catch SIGTERM:
# Node.js graceful shutdown:
process.on('SIGTERM', () => {
server.close(() => {
db.end(); // close DB connections
process.exit(0); // clean exit
});
// If not closed within 10s, force exit:
setTimeout(() => process.exit(1), 10000);
});
Key Interview Points
- Rolling updates are simplest; blue-green gives instant cutover; canary minimizes blast radius
- Old and new code run simultaneously during any zero-downtime deployment — API and DB changes must be backward-compatible
- Use Expand-Contract for schema migrations: add nullable column → backfill → add NOT NULL constraint
- Always create indexes with CONCURRENTLY in PostgreSQL to avoid locking
- Feature flags decouple deployment from release — enable kill switches in production
- Readiness probes prevent traffic reaching pods that are not fully initialized
Frequently Asked Questions
What is the difference between blue-green and canary deployments?
Blue-green deployment maintains two complete, identical environments. At cutover, 100% of traffic switches from the old (blue) environment to the new (green) environment instantaneously. This provides instant rollback (switch back to blue) and complete isolation (green is fully tested before any user sees it). Downside: requires double infrastructure during deployment; the switch is all-or-nothing so if there is a bug, 100% of users are affected until rollback. Canary deployment gradually shifts traffic from old to new version — starting with a small percentage (1-5%) and increasing as confidence builds. A bug in the new version only affects the canary percentage of users. Automated analysis (Flagger, Spinnaker) can roll back automatically if error rate or latency degrades. Canary is safer for high-traffic systems because the blast radius of a bad deploy is limited. Downside: takes longer (30-60 minutes for a full rollout vs instant blue-green cutover); old and new versions must coexist for an extended period, requiring API backward compatibility. Most large-scale systems prefer canary; blue-green is simpler and works well for smaller systems or where instant rollback is paramount.
How do you handle database schema changes safely during a zero-downtime deployment?
The core challenge is that during a rolling or canary deployment, old and new application code run simultaneously against the same database. If new code expects a column that old code does not write, or if you drop a column still read by old code, you get errors. The Expand-Contract (parallel change) pattern solves this: to add a required column, first add it as nullable (instant DDL in PostgreSQL 11+, no table lock). Old code runs fine — it ignores the new column. New code writes to the new column. Once all traffic runs on new code, backfill null rows and add the NOT NULL constraint. To rename a column: add the new column, deploy code that double-writes to both, backfill historical rows, deploy code that reads from the new column, verify, then drop the old column. Critical PostgreSQL rules: always create indexes with CREATE INDEX CONCURRENTLY (non-blocking, takes longer but does not lock). Never run ALTER TABLE operations that require full table rewrites (type changes, adding NOT NULL without default) on large tables during a deployment — schedule these during low-traffic windows or use pg_repack.
What is a readiness probe and how does it enable zero-downtime deployments in Kubernetes?
A readiness probe is a health check that Kubernetes uses to determine whether a pod is ready to receive traffic. Unlike a liveness probe (is the process alive?), a readiness probe asks: is this instance fully initialized and capable of serving requests correctly? During a zero-downtime rolling deployment: Kubernetes creates a new pod, but does NOT add it to the Service's endpoint list until the readiness probe succeeds. The probe might check: is the database connection pool established? Is the cache warmed? Are all dependencies reachable? Only when these pass does Kubernetes route traffic to the new pod. Simultaneously, the old pod is kept in rotation until the new one is ready. If the readiness probe fails after the pod starts serving traffic (e.g., database becomes unreachable), Kubernetes removes the pod from rotation — no more traffic until it recovers. This prevents the all-too-common issue of traffic routing to a pod that is starting up (before it has loaded its code and connected to databases) and returning 502 errors to users. Configure readiness probes with appropriate initialDelaySeconds (skip early checks during startup) and failureThreshold (how many consecutive failures before removing from rotation).
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between blue-green and canary deployments?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Blue-green deployment maintains two complete, identical environments. At cutover, 100% of traffic switches from the old (blue) environment to the new (green) environment instantaneously. This provides instant rollback (switch back to blue) and complete isolation (green is fully tested before any user sees it). Downside: requires double infrastructure during deployment; the switch is all-or-nothing so if there is a bug, 100% of users are affected until rollback. Canary deployment gradually shifts traffic from old to new version — starting with a small percentage (1-5%) and increasing as confidence builds. A bug in the new version only affects the canary percentage of users. Automated analysis (Flagger, Spinnaker) can roll back automatically if error rate or latency degrades. Canary is safer for high-traffic systems because the blast radius of a bad deploy is limited. Downside: takes longer (30-60 minutes for a full rollout vs instant blue-green cutover); old and new versions must coexist for an extended period, requiring API backward compatibility. Most large-scale systems prefer canary; blue-green is simpler and works well for smaller systems or where instant rollback is paramount.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle database schema changes safely during a zero-downtime deployment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The core challenge is that during a rolling or canary deployment, old and new application code run simultaneously against the same database. If new code expects a column that old code does not write, or if you drop a column still read by old code, you get errors. The Expand-Contract (parallel change) pattern solves this: to add a required column, first add it as nullable (instant DDL in PostgreSQL 11+, no table lock). Old code runs fine — it ignores the new column. New code writes to the new column. Once all traffic runs on new code, backfill null rows and add the NOT NULL constraint. To rename a column: add the new column, deploy code that double-writes to both, backfill historical rows, deploy code that reads from the new column, verify, then drop the old column. Critical PostgreSQL rules: always create indexes with CREATE INDEX CONCURRENTLY (non-blocking, takes longer but does not lock). Never run ALTER TABLE operations that require full table rewrites (type changes, adding NOT NULL without default) on large tables during a deployment — schedule these during low-traffic windows or use pg_repack.”
}
},
{
“@type”: “Question”,
“name”: “What is a readiness probe and how does it enable zero-downtime deployments in Kubernetes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A readiness probe is a health check that Kubernetes uses to determine whether a pod is ready to receive traffic. Unlike a liveness probe (is the process alive?), a readiness probe asks: is this instance fully initialized and capable of serving requests correctly? During a zero-downtime rolling deployment: Kubernetes creates a new pod, but does NOT add it to the Service’s endpoint list until the readiness probe succeeds. The probe might check: is the database connection pool established? Is the cache warmed? Are all dependencies reachable? Only when these pass does Kubernetes route traffic to the new pod. Simultaneously, the old pod is kept in rotation until the new one is ready. If the readiness probe fails after the pod starts serving traffic (e.g., database becomes unreachable), Kubernetes removes the pod from rotation — no more traffic until it recovers. This prevents the all-too-common issue of traffic routing to a pod that is starting up (before it has loaded its code and connected to databases) and returning 502 errors to users. Configure readiness probes with appropriate initialDelaySeconds (skip early checks during startup) and failureThreshold (how many consecutive failures before removing from rotation).”
}
}
]
}