Database schema migrations are one of the riskiest operations in production systems. A poorly executed migration can lock tables for hours, corrupt data, or cause downtime. This guide covers production-proven migration strategies that keep your application running while the schema evolves — essential knowledge for senior engineering interviews and real-world operations.
The Expand-Contract Pattern
The expand-contract pattern (also called parallel change) is the safest approach for breaking schema changes. Three phases: (1) Expand — add the new column or table alongside the existing one. Both old and new schemas coexist. The application writes to both old and new columns. (2) Migrate — backfill existing data from the old column to the new column. Run in batches to avoid locking: UPDATE users SET email_normalized = LOWER(email) WHERE email_normalized IS NULL LIMIT 1000. (3) Contract — after all data is migrated and the application no longer reads the old column, drop it. Each phase is a separate deployment. If something goes wrong, you roll back only the current phase without data loss. Example: renaming a column from username to user_handle. Expand: add user_handle column, write to both. Migrate: backfill user_handle from username. Contract: stop reading username, drop the column. This takes 3 deployments instead of 1, but eliminates downtime risk.
Online DDL Tools: gh-ost and pt-online-schema-change
Large table alterations in MySQL lock the table for the duration of the ALTER TABLE operation. A 500M-row table may be locked for hours. Online DDL tools solve this. gh-ost (GitHub Online Schema Tool): (1) Creates a ghost table with the new schema. (2) Copies rows from the original table to the ghost table in small batches (100-1000 rows). (3) Simultaneously tails the MySQL binary log to capture any writes to the original table during the copy and applies them to the ghost table. (4) When the copy is complete and the ghost table is caught up with the binlog, performs an atomic rename: RENAME TABLE original TO _old, ghost TO original. The application experiences only a brief (sub-second) lock during the rename. gh-ost advantages: does not use triggers (unlike pt-online-schema-change), which means no write amplification on the original table. Throttling: gh-ost monitors replication lag and pauses copying if lag exceeds a threshold, preventing replica drift. PostgreSQL: use CREATE INDEX CONCURRENTLY for index creation without locks. For column additions, ALTER TABLE ADD COLUMN with a default value is instant in PostgreSQL 11+ (the default is stored in the catalog, not written to each row).
Backward-Compatible Migrations
During a rolling deployment, old and new application versions run simultaneously. Migrations must be backward-compatible with both versions. Safe operations: adding a nullable column (old code ignores it), adding a new table (old code does not reference it), adding an index (transparent to application code), adding a column with a default value. Unsafe operations: dropping a column (old code still references it — will crash), renaming a column (old code uses the old name), changing a column type (old code expects the old type), adding a NOT NULL constraint without a default (existing rows violate it). For unsafe operations, use the expand-contract pattern to make them safe across multiple deployments. Migration ordering rule: deploy code that handles both schemas first, then run the migration. Never run a migration that removes something the currently-deployed code depends on.
Data Backfill Strategies
Backfilling data into a new column on a large table requires careful execution. Naive approach: UPDATE users SET new_col = compute(old_col). This locks the entire table in a single transaction and may timeout or exhaust memory. Production approach: batch processing with throttling. Process 1000 rows per batch with a 100ms sleep between batches. Track progress with a cursor: UPDATE users SET new_col = compute(old_col) WHERE id > last_processed_id AND id <= last_processed_id + 1000 AND new_col IS NULL. This is resumable (if interrupted, restart from last_processed_id), throttled (does not overwhelm the database), and observable (log progress every 10,000 rows). For very large tables (billions of rows), run the backfill on a read replica, then promote it. Or use a Spark job to read from the replica, compute new values, and write back to the primary in batches via JDBC.
Migration Rollback Strategies
Every migration should have a tested rollback plan. Rollback approaches: (1) Reverse migration — for additive changes (add column, add index), the rollback is the inverse operation (drop column, drop index). Write both up and down migrations. (2) For destructive changes (drop column), the rollback is not possible after execution — this is why the expand-contract pattern exists. During the expand phase, both columns exist, so rollback means simply stopping writes to the new column. (3) Point-in-time recovery (PITR) — for catastrophic migration failures, restore the database from a backup to a point before the migration. This loses all data written after the backup. PITR is the last resort. (4) Blue-green database pattern — maintain two database instances. Run the migration on the green instance while traffic hits the blue instance. Switch traffic to green after verifying the migration. Rollback: switch back to blue. This requires application-level database routing and careful replication management.
Testing Migrations Before Production
Migration testing checklist: (1) Run the migration against a production-sized dataset. A migration that takes 1 second on 1000 rows may take 6 hours on 500M rows. Restore a production backup to a staging environment and time the migration. (2) Verify backward compatibility by running the previous application version against the new schema and the new application version against the old schema. Both must work. (3) Test the rollback migration. Apply the up migration, then apply the down migration, and verify data integrity. (4) Check for lock conflicts. On MySQL, monitor SHOW PROCESSLIST during the migration to detect blocked queries. On PostgreSQL, check pg_stat_activity for waiting locks. (5) Measure replication lag during the migration. If the migration generates a large volume of binlog events, replicas may fall behind. Monitor Seconds_Behind_Master (MySQL) or pg_stat_replication (PostgreSQL).