Blue-Green Deployment Low-Level Design: Environment Swap, Database Migration, and Cutover

Blue-green deployment eliminates downtime by maintaining two identical production environments and cutting over traffic instantaneously. This design covers environment management, database migration strategy, health gating before cutover, and instant rollback mechanics.

Requirements

Functional

Maintain two named environments (Blue and Current-Active) with identical infrastructure.
Deploy the new version to the idle environment without affecting live traffic.
Run database migrations in backward-compatible phases before cutover.
Validate the idle environment via health gates before switching traffic.
Cut over in under 10 seconds and roll back in under 10 seconds.

Non-Functional

Zero dropped requests during cutover (connection draining).
Immutable environment records for compliance audit.
DNS or load balancer TTL under 30 seconds to bound stale-routing windows.

Data Model

Environment — envId, color (BLUE or GREEN), serviceVersion, infraManifestRef, status (IDLE, WARMING, ACTIVE, DRAINING), healthGatePassed, promotedAt.
DeploymentPlan — planId, targetColor, newVersion, migrationSteps (ordered list), healthGates, cutoverPolicy (INSTANT or GRADUAL), status.
MigrationStep — stepId, planId, phase (EXPAND or CONTRACT), ddlScript, rollbackScript, appliedAt, checksum.
CutoverEvent — eventId, planId, fromColor, toColor, actor, occurredAt, durationMs.

Database Migration Strategy

Both environments share the same database during the transition window, so schema changes must be backward-compatible with both the old and new application versions simultaneously. Follow the expand-contract pattern: in the EXPAND phase (before cutover), add new columns with defaults, add new tables, and create new indexes concurrently. Neither version reads the new structures yet. In the CONTRACT phase (after rollback window expires), drop old columns, rename tables, and add NOT NULL constraints. Never combine expand and contract in a single deployment.

Track each migration step with its checksum in MigrationStep. Before applying, verify the checksum matches the script in the deployment plan to prevent drift.

Core Algorithms

Health Gate Evaluation

Before cutover, run a synthetic health check suite against the idle environment using internal DNS or a staging VIP. Gates include: HTTP health endpoint returns 200, all downstream dependencies reachable from the idle environment, warm-up request set completes within latency SLA, and database connection pool is fully established. All gates must pass within a configurable timeout. Gate results are stored and linked to the deployment plan for audit.

Cutover Mechanics

Cutover updates a single routing record. For load-balancer-based switching, update the target group or upstream block to point to the idle environment. For DNS-based switching, update the A record TTL to 10 seconds before cutover, then update the record. The old environment enters DRAINING state: it continues serving in-flight requests (connection drain) for drainTimeoutSeconds, then moves to IDLE. Keep the old environment alive for the rollback window (typically one hour) before decommissioning.

Rollback

Rollback is identical to cutover in the reverse direction. Because the old environment is still DRAINING (or IDLE with its processes running), re-routing takes the same sub-10-second path. The key invariant is that the old version must remain compatible with the current (expanded) database schema. If a CONTRACT migration was already applied, rollback is blocked until the DBA confirms compatibility or the old schema is re-expanded.

API Design

POST /deployment-plans — create a plan with version, migration steps, and health gates.
POST /deployment-plans/{id}/deploy — provision the idle environment and run EXPAND migrations.
POST /deployment-plans/{id}/gate-check — run health gates against the idle environment on demand.
POST /deployment-plans/{id}/cutover — execute traffic switch after gates pass.
POST /deployment-plans/{id}/rollback — switch traffic back to the old environment.
GET /environments/{color}/status — current health, version, and request rate.

Scalability and Observability

Use infrastructure-as-code (Terraform or Pulumi) to provision idle environments from a versioned template, ensuring parity with the active environment.
Monitor requests_routed_to{color} at the load balancer to confirm cutover atomicity — both colors should not serve production traffic simultaneously except during drain.
Emit a deployment_cutover_duration_ms histogram to track cutover speed over time and alert when it degrades.
Lock concurrent deployments to one active plan per service to prevent two teams from cutting over simultaneously.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the environment swap mechanism work in blue-green deployment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Two identical production environments (blue and green) are maintained. The live environment (e.g., blue) serves all traffic while the new version is deployed and validated on the idle environment (green). A router or load balancer swap redirects all traffic to green atomically, making green the new live environment with near-zero downtime.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle backward-compatible database migrations in blue-green deployment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Schema changes are applied in expand-contract phases. First, additive changes (new columns, tables) are deployed while both old and new code can operate on the schema. After the swap completes and the old environment is retired, a second migration removes or renames obsolete structures. This ensures the database is compatible with both versions during the transition window.”
}
},
{
“@type”: “Question”,
“name”: “What health gates should pass before cutting over to the green environment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Before the traffic swap, the green environment must pass readiness probes (all instances healthy), smoke tests (critical paths return correct responses), and optionally a canary period with a small traffic slice. Key metrics—error rate, p99 latency, and dependency connectivity—must remain within acceptable thresholds for a defined stabilization window.”
}
},
{
“@type”: “Question”,
“name”: “How does instant rollback work in a blue-green deployment system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Because the old blue environment remains live and unchanged after the swap, rollback is simply redirecting the router back to blue. This is the same atomic operation as the original cutover and takes seconds, with no redeployment required—as long as any database migrations applied were backward-compatible with the blue version's code.”
}
}
]
}