Question 1

How does gh-ost perform online schema changes without locking the table?

Accepted Answer

gh-ost (GitHub Online Schema Tool) performs ALTER TABLE operations on large MySQL tables without blocking reads or writes. Process: (1) Create a ghost table with the desired new schema (the ALTER applied to an empty copy). (2) Copy rows from the original table to the ghost table in small batches (configurable chunk size, typically 100-1000 rows). Between batches, gh-ost sleeps briefly to avoid overwhelming the server. (3) Simultaneously, gh-ost connects to the MySQL binary log as a replica and captures all DML operations (INSERT, UPDATE, DELETE) on the original table. These changes are applied to the ghost table in real-time, keeping it synchronized. (4) When the row copy is complete and the ghost table has caught up with the binlog, gh-ost performs an atomic cut-over: RENAME TABLE original TO _old, ghost TO original. The rename is atomic in MySQL and takes only milliseconds. (5) The old table (now named _old) can be dropped after verification. Key advantage over pt-online-schema-change: gh-ost does not use triggers on the original table, avoiding the write amplification and lock contention that triggers cause under heavy write loads.

Question 2

What is the expand-contract pattern and why is it the safest approach for schema migrations?

Accepted Answer

The expand-contract pattern splits a breaking schema change into three safe, independently deployable phases. Expand phase: add the new structure alongside the old one. For example, to rename a column from username to user_handle: add the user_handle column (nullable) without removing username. Deploy application code that writes to both columns simultaneously. This deployment is backward-compatible -- the old column still works. Migrate phase: backfill existing data from the old column to the new column in batches. Run UPDATE users SET user_handle = username WHERE user_handle IS NULL LIMIT 1000 repeatedly until all rows are migrated. This runs alongside normal operations without downtime. Contract phase: after verifying all data is migrated and all application code reads from the new column, deploy code that removes references to the old column. Then drop the old column. Each phase is a separate deployment with its own rollback plan. If the expand phase causes issues, drop the new column. If the migrate phase fails, restart the backfill. If the contract phase fails, re-add the old column reference. No phase is irreversible until the final column drop.

Question 3

How do you safely backfill data in a large production table?

Accepted Answer

Naive backfill (UPDATE all rows in one transaction) locks the table and can cause downtime. Safe backfill strategy: (1) Process in small batches -- UPDATE users SET new_col = compute(old_col) WHERE id BETWEEN batch_start AND batch_end AND new_col IS NULL. Batch size of 1000-5000 rows per iteration. (2) Throttle between batches -- sleep 100-500ms between batches to allow normal queries to execute. Monitor replication lag; if lag exceeds 5 seconds, pause the backfill until it recovers. (3) Track progress -- store the last processed ID in a separate progress table or log it. This makes the backfill resumable after interruption. (4) Make it idempotent -- the WHERE new_col IS NULL condition ensures re-running the backfill on already-processed rows is a no-op. (5) Monitor impact -- watch database CPU, query latency, and replication lag during the backfill. Adjust batch size and sleep duration dynamically. For very large tables (billions of rows), consider running the backfill on a read replica using a Spark or custom ETL job, then promoting the replica. Or use the database built-in online DDL if available (PostgreSQL 11+ handles ADD COLUMN with DEFAULT instantly without rewriting the table).

Question 4

How do you ensure database migrations are backward-compatible during rolling deployments?

Accepted Answer

During a rolling deployment, old and new application versions run simultaneously against the same database. Every migration must be compatible with both versions. Safe operations that maintain backward compatibility: adding a nullable column (old code ignores it), adding a new table (old code does not reference it), adding an index (transparent to application queries), and widening a column type (VARCHAR(50) to VARCHAR(100)). Unsafe operations that break backward compatibility: dropping a column (old code SELECT * includes it), renaming a column (old code references the old name), narrowing a column type (data truncation), and adding a NOT NULL constraint without a default (existing INSERTs from old code fail). The rule: deploy code that handles both schemas first, then run the migration. Example sequence for renaming a column: (1) Deploy code that reads from both old and new names (coalesce). (2) Run migration to add new column. (3) Backfill data. (4) Deploy code that reads only from the new column. (5) Run migration to drop the old column. Each step is a separate deployment with validation between steps.

System Design: Database Migration Strategies — Zero-Downtime Schema Changes, Online DDL, gh-ost, Expand-Contract Pattern

The Expand-Contract Pattern

Online DDL Tools: gh-ost and pt-online-schema-change

Backward-Compatible Migrations

Data Backfill Strategies

Migration Rollback Strategies

Testing Migrations Before Production