Why must database migrations be backward-compatible in blue-green deployments?

During the traffic switch, both environments may briefly handle requests, and after a rollback the old environment resumes. If green drops or renames a column, blue breaks immediately. Migrations must use the expand-contract pattern: add new structures before the switch, remove old ones only after the new version is proven stable.

What is the health check gate and why require consecutive successes?

The health check gate blocks traffic switching until the new environment passes N consecutive health checks. A single pass is insufficient because services often have transient failures during startup (connection pool warm-up, cache misses). Requiring N consecutive successes (typically 5 at 10-second intervals) confirms the environment is genuinely stable.

What error rate threshold triggers automatic rollback?

A common default is 1% 5xx error rate measured over a 60-second rolling window, monitored for 5 minutes post-switch. The threshold depends on the baseline error rate of the service. Teams often set it at baseline + 0.5%. If breached, the rollback fires the same atomic traffic switch in reverse.

How are database connections handled during the traffic switch?

Each environment maintains its own connection pool. The load balancer uses connection draining: in-flight requests on the outgoing environment are allowed to complete (up to a 30-second drain timeout) before that environment is removed from rotation. No active requests are interrupted. New requests immediately go to the new environment.

How are database migrations coordinated with blue-green deployments?

Migrations run before the traffic switch and must be backward-compatible — additive changes only (new columns with defaults, no column renames or drops) so both blue and green can operate against the same schema simultaneously.

How does the health check gate prevent premature traffic switching?

The deployment pipeline polls the green environment health endpoint and requires N consecutive successes before allowing the traffic switch; a single failure resets the counter.

How is automatic rollback triggered after a traffic switch?

A monitor tracks the 5xx error rate on the new active environment; if the rate exceeds a configured threshold within the observation window, the load balancer upstream is atomically switched back to the previous environment.

How are database connections managed during the switch?

Long-lived connection pools in the old environment are drained gracefully using a connection timeout; new connections from the active environment are routed to the shared database without interruption.

Blue-Green Deployment System Low-Level Design: Traffic Switching, Health Checks, and Automated Rollback

⏱ 7 min read

Blue-green deployment eliminates downtime by maintaining two identical production environments. At any moment, one environment (blue) serves 100% of live traffic while the other (green) sits idle and ready to receive the next release. The load balancer is the single toggle point: flipping traffic from blue to green takes milliseconds and, crucially, can be reversed just as fast.

Environment Model

Blue is always the currently active environment. Green is the staging target for the next release. Both environments are identical in infrastructure: same instance types, same configuration, same database connection strings. The only difference is the application version running on each. After a successful switch, the roles conceptually swap: the previously blue environment becomes the idle standby for the next deployment cycle.

Deployment Pipeline

The pipeline follows a strict sequence:

Identify inactive environment — query the DeploymentEnvironment table for the environment not currently receiving traffic.
Deploy new version — push artifacts to inactive environment, restart services.
Health check gate — poll the health endpoint. Green must return HTTP 200 for N consecutive checks (default: 5, interval: 10 seconds) before proceeding. Any failure resets the counter.
Smoke tests — run a synthetic transaction suite against green directly (bypassing the load balancer).
Switch traffic — atomic update to the load balancer. For nginx, reload the upstream block. For AWS ALB, update target group weights to 0/100.
Monitor — watch 5xx error rate for 5 minutes. Trigger auto-rollback if threshold is breached.

Health Check Gate

A single passing health check is not sufficient. Transient startup issues, connection pool warm-up, and cache misses can cause intermittent failures immediately after boot. The gate requires N consecutive successes. The health endpoint must check: application process is up, database connectivity, and any critical downstream dependency.

Traffic Switch Implementation

For nginx, the switch rewrites the upstream block and sends SIGHUP:

upstream active_backend {
    server green-env:8080;
}

For AWS ALB, update target group weights via the API:

aws elbv2 modify-listener --listener-arn $ARN 
  --default-actions Type=forward,ForwardConfig='{
    "TargetGroups": [
      {"TargetGroupArn": "$GREEN_TG", "Weight": 100},
      {"TargetGroupArn": "$BLUE_TG",  "Weight": 0}
    ]
  }'

The switch must be atomic from the load balancer's perspective. No request should be split between environments.

Error Rate Monitor and Auto-Rollback

Immediately after the switch, a monitoring process polls the metrics endpoint every 30 seconds. If the 5xx error rate exceeds the configured threshold (e.g., 1%) within the first 5 minutes, an automatic rollback fires: the traffic switch is reversed and a DeploymentEvent is recorded with event_type = ROLLBACK_AUTO. The rollback itself is the same atomic load balancer update, just in reverse.

Database Migration Strategy

Database migrations are the hardest part of blue-green deployments. The rule is strict: all migrations must be backward-compatible. Green must be able to operate against the schema before and after the migration runs. This means:

Additive changes only at switch time — add columns as nullable, add new tables.
No destructive changes at switch time — column renames and drops happen in a separate, later deployment after blue is also running the new code.
Expand-contract pattern — Phase 1: add new column (both blue and green write to old column; green also writes to new). Phase 2: switch traffic. Phase 3: drop old column once green is stable.

SQL Schema

CREATE TABLE deployment_environment (
    id            SERIAL PRIMARY KEY,
    color         VARCHAR(10) NOT NULL CHECK (color IN ('blue', 'green')),
    version       VARCHAR(100) NOT NULL,
    status        VARCHAR(30) NOT NULL DEFAULT 'idle',
    -- status: idle | deploying | health_checking | active | rolling_back
    health_check_url TEXT NOT NULL,
    deployed_at   TIMESTAMPTZ,
    activated_at  TIMESTAMPTZ,
    UNIQUE (color)
);

CREATE TABLE deployment_event (
    id         BIGSERIAL PRIMARY KEY,
    env_id     INT NOT NULL REFERENCES deployment_environment(id),
    event_type VARCHAR(50) NOT NULL,
    -- DEPLOY_START | HEALTH_CHECK_PASS | HEALTH_CHECK_FAIL |
    -- SMOKE_TEST_PASS | TRAFFIC_SWITCH | ROLLBACK_AUTO | ROLLBACK_MANUAL
    actor      VARCHAR(100),
    metadata   JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON deployment_event (env_id, created_at DESC);

CREATE TABLE traffic_switch (
    id             SERIAL PRIMARY KEY,
    from_color     VARCHAR(10) NOT NULL,
    to_color       VARCHAR(10) NOT NULL,
    switched_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    rolled_back_at TIMESTAMPTZ,
    rollback_reason TEXT
);

Python Implementation

import time, requests, psycopg2, boto3

def deploy_to_inactive(version: str) -> dict:
    """Deploy new version to the inactive environment and return its record."""
    env = get_inactive_environment()
    update_env_status(env["id"], "deploying")
    # Trigger deployment (Ansible, Kubernetes rollout, etc.)
    trigger_deployment(env["color"], version)
    update_env_status(env["id"], "health_checking")
    record_event(env["id"], "DEPLOY_START", metadata={"version": version})
    return env

def run_health_checks(env: dict, required_consecutive: int = 5, interval: int = 10) -> bool:
    """Poll health endpoint; return True only after N consecutive successes."""
    consecutive = 0
    url = env["health_check_url"]
    while consecutive  int:
    """Atomically switch ALB traffic; return switch_id."""
    alb = boto3.client("elbv2")
    alb.modify_listener(
        ListenerArn=LISTENER_ARN,
        DefaultActions=[{
            "Type": "forward",
            "ForwardConfig": {
                "TargetGroups": [
                    {"TargetGroupArn": TARGET_GROUPS[to_color],   "Weight": 100},
                    {"TargetGroupArn": TARGET_GROUPS[from_color], "Weight": 0},
                ]
            }
        }]
    )
    switch_id = record_traffic_switch(from_color, to_color)
    record_event(get_env_by_color(to_color)["id"], "TRAFFIC_SWITCH",
                 metadata={"switch_id": switch_id})
    return switch_id

def monitor_and_rollback(switch_id: int, threshold: float = 0.01,
                         window_seconds: int = 300, poll_interval: int = 30):
    """Monitor error rate post-switch; auto-rollback if threshold breached."""
    deadline = time.time() + window_seconds
    switch = get_traffic_switch(switch_id)
    while time.time()  threshold:
            switch_traffic(switch["to_color"], switch["from_color"])
            mark_switch_rolled_back(switch_id,
                                    reason=f"5xx rate {rate:.2%} exceeded {threshold:.2%}")
            record_event(get_env_by_color(switch["to_color"])["id"],
                         "ROLLBACK_AUTO",
                         metadata={"error_rate": rate, "switch_id": switch_id})
            return False
        time.sleep(poll_interval)
    return True  # stable

Database Connection Handling During Switch

Both environments maintain their own connection pools to the database. During the switch, in-flight requests on the old environment complete normally — the load balancer drains connections with a configurable timeout (typically 30 seconds) before fully removing the old environment from rotation. No connections are dropped mid-request.