Question 1

What is the difference between blue-green and canary deployments?

Accepted Answer

Blue-green deployment switches 100% of traffic from the old version (blue) to the new version (green) at once. You maintain two complete production environments. After deploying to green and running smoke tests, the load balancer switches all traffic. Rollback is instant -- switch back to blue. Disadvantage: no gradual validation; all users get the new version simultaneously, so a bug affects everyone. Canary deployment gradually shifts traffic: 1% to the new version, then 5%, 10%, 25%, 50%, 100%. At each stage, metrics are compared between the canary (new version) and the baseline (old version). If the canary shows higher error rates or latency, traffic is rolled back to 0% on the new version. Advantage: limits blast radius; a bug in the canary affects only a small percentage of users. Disadvantage: both versions run simultaneously for an extended period, which complicates stateful operations and database migrations. Choose blue-green for simpler applications with fast smoke tests. Choose canary for high-traffic services where gradual validation reduces risk. Many organizations use canary as the default and reserve blue-green for infrastructure changes.

Question 2

How do Kubernetes readiness and liveness probes prevent downtime during deployments?

Accepted Answer

Readiness probes determine when a pod is ready to receive traffic. During a rolling update, Kubernetes creates new pods but does not route traffic to them until the readiness probe succeeds. This prevents users from hitting a pod that is still starting (loading configuration, warming caches, establishing database connections). Configure the readiness probe to check a health endpoint that verifies all dependencies are connected: GET /health/ready returns 200 only when the database connection pool is established, the cache is warmed, and the application is fully initialized. Liveness probes determine when a pod is unhealthy and needs to be restarted. If a liveness probe fails consecutively (failureThreshold times), Kubernetes kills and restarts the pod. This handles deadlocks, memory leaks, and stuck processes. Critical deployment settings: initialDelaySeconds (wait before first probe -- give the application time to start), periodSeconds (probe frequency), failureThreshold (consecutive failures before action). Common mistake: using the same endpoint for both probes. The liveness probe should check basic process health (is the HTTP server responding?). The readiness probe should check dependency connectivity (is the database reachable?). A database outage should make pods not-ready (stop receiving traffic) but not restart them (the database being down is not fixed by restarting the application).

Question 3

How do you handle graceful shutdown during Kubernetes pod termination?

Accepted Answer

When Kubernetes terminates a pod (during a rolling update or scale-down), it sends SIGTERM to the process and waits for terminationGracePeriodSeconds (default 30 seconds). During this period, the application must: (1) Stop accepting new requests -- the pod is removed from the Service endpoints, but in-flight requests from the load balancer may still arrive for a few seconds due to propagation delay. Add a preStop hook with a short sleep (5 seconds) to allow the endpoint removal to propagate. (2) Complete in-flight requests -- finish processing requests that are already being handled. Do not drop active connections. (3) Close resources cleanly -- close database connection pools, flush log buffers, finish writing to message queues, and acknowledge pending messages. (4) Exit with code 0. If the process does not exit within terminationGracePeriodSeconds, Kubernetes sends SIGKILL (cannot be caught). For long-running operations (video transcoding, large file uploads), increase terminationGracePeriodSeconds or implement checkpointing so the operation can be resumed by another pod. Node.js: handle SIGTERM with process.on SIGTERM. Java/Spring: use @PreDestroy. Go: use signal.Notify for os.Interrupt and syscall.SIGTERM.

Question 4

How do feature flags decouple deployment from release?

Accepted Answer

Feature flags (feature toggles) allow code to be deployed to production in a disabled state and enabled independently of the deployment process. Deployment is pushing code to servers. Release is making functionality available to users. With feature flags, these are separate events. Deployment happens continuously (multiple times per day via CI/CD). Release happens when the team decides the feature is ready -- by toggling a flag, not by deploying. Implementation: wrap new code paths in a flag check: if feature_flags.is_enabled(new_checkout_flow, user_context) then execute the new code path, else execute the old code path. The flag evaluation can target: specific users (internal testers), percentage of traffic (1% canary), user attributes (premium plan, specific country), or all users (full rollout). Kill switch: if the new feature causes problems after release, disable the flag immediately. Effect is near-instant (seconds) compared to a rollback deployment (minutes to hours). Lifecycle: create the flag, develop behind it, test internally, canary to 1%, ramp to 100%, remove the flag and dead code. Critical: set a cleanup deadline when creating the flag. Stale flags accumulate as technical debt -- unused code paths that are never executed but must be maintained.

System Design: Zero-Downtime Deployment — Blue-Green, Canary, Rolling Updates, Kubernetes, Feature Flags

Blue-Green Deployment

Canary Deployment

Rolling Updates in Kubernetes

Feature Flags for Deployment Decoupling

Database Migrations During Zero-Downtime Deploys

Deployment Observability