Zero-Downtime Database Migrations
Running schema migrations on a production PostgreSQL database that serves thousands of requests per second is one of the most nerve-wracking tasks in backend engineering. A single ALTER TABLE ... ADD COLUMN ... NOT NULL can lock your table for minutes, cascade into connection pool exhaustion, and bring your entire platform down. This article covers battle-tested strategies for executing migrations with zero user-facing impact.
The Expand-Contract Pattern
The expand-contract pattern (sometimes called parallel change) splits every destructive migration into three safe, independently deployable phases:
- Expand -- Add the new structure alongside the old one. Both coexist.
- Migrate -- Backfill data, switch application reads/writes to the new structure.
- Contract -- Drop the old structure once nothing references it.
Each phase is a separate deployment. If anything goes wrong, you roll back only that phase, not the entire migration.
Example: Renaming a Column
Suppose you need to rename users.name to users.full_name. A naive ALTER TABLE users RENAME COLUMN takes an ACCESS EXCLUSIVE lock -- the most restrictive lock PostgreSQL offers.
Phase 1: Expand
-- Migration V1: Add the new column (takes ACCESS EXCLUSIVE but instant on PG 11+)
ALTER TABLE users ADD COLUMN full_name TEXT;
-- Create a trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_user_name() RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' OR NEW.name IS DISTINCT FROM OLD.name THEN
NEW.full_name := NEW.name;
END IF;
IF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
NEW.name := NEW.full_name;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_user_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_user_name();Phase 2: Migrate
-- Backfill in batches to avoid long-running transactions
DO $$
DECLARE
batch_size INT := 10000;
rows_updated INT;
BEGIN
LOOP
UPDATE users
SET full_name = name
WHERE full_name IS NULL
AND id IN (
SELECT id FROM users
WHERE full_name IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
RAISE NOTICE 'Updated % rows', rows_updated;
EXIT WHEN rows_updated = 0;
COMMIT;
PERFORM pg_sleep(0.1); -- brief pause to reduce load
END LOOP;
END $$;The key details: FOR UPDATE SKIP LOCKED ensures the backfill does not block application queries on the same rows. Batching with a sleep interval keeps replication lag and WAL generation under control.
Phase 3: Contract
After your application code no longer references users.name anywhere (verified via query logging or pg_stat_statements):
-- Drop the sync trigger
DROP TRIGGER trg_sync_user_name ON users;
DROP FUNCTION sync_user_name();
-- Drop the old column
ALTER TABLE users DROP COLUMN name;Blue-Green Database Strategy
For truly large structural changes (splitting a table, changing a primary key type from INT to UUID), the expand-contract approach at the column level may not suffice. Instead, use a blue-green database strategy powered by logical replication.
- Provision a "green" replica with the new schema.
- Set up pg_logical replication from the "blue" (current) database to the "green" one, applying a transform function that maps old rows into the new schema.
- Validate data parity using checksums or row-count comparisons.
- Switch application traffic to the green database. Connection strings are managed via PgBouncer or a DNS CNAME flip.
- Decommission the blue database after a bake-in period.
-- On the BLUE (source) database
CREATE PUBLICATION blue_pub FOR TABLE users, orders, products;
-- On the GREEN (target) database with new schema
CREATE SUBSCRIPTION green_sub
CONNECTION 'host=blue-db port=5432 dbname=app user=replicator'
PUBLICATION blue_pub
WITH (copy_data = true, synchronous_commit = off);Avoiding Common Lock Pitfalls
PostgreSQL acquires different lock levels for different DDL statements. Here is a quick reference for the most common operations:
| Operation | Lock Level | Safe? | |---|---|---| | ADD COLUMN (nullable, no default) | ACCESS EXCLUSIVE (instant) | Yes | | ADD COLUMN with DEFAULT (PG 11+) | ACCESS EXCLUSIVE (instant) | Yes | | ADD COLUMN NOT NULL (no default) | ACCESS EXCLUSIVE (full rewrite) | No | | CREATE INDEX | SHARE (blocks writes) | No | | CREATE INDEX CONCURRENTLY | SHARE UPDATE EXCLUSIVE | Yes | | DROP COLUMN | ACCESS EXCLUSIVE (instant) | Yes* |
DROP COLUMN only marks the column as invisible in the catalog; it does not rewrite the table. However, subsequent VACUUM FULL will reclaim the space.
Setting a Lock Timeout
Always set a lock_timeout before running DDL in production. If the lock cannot be acquired within the timeout, the migration fails fast instead of queuing behind long-running queries and causing a cascading pileup.
SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN tracking_number TEXT;
-- If the lock isn't acquired within 5 seconds, this fails
-- and you can retry during a quieter window.Tooling Recommendations
- pgroll -- Xata's open-source tool that automates the expand-contract pattern with version-aware views.
- reshape -- Another zero-downtime migration tool that creates shadow columns and triggers automatically.
- pg-osc -- Stripe's online schema change tool, inspired by MySQL's pt-online-schema-change.
- sqitch -- Dependency-aware migration manager that works well with the phased approach.
Key Takeaways
- Never run destructive DDL in a single step on a live database.
- Split every migration into expand, migrate, and contract phases.
- Always set
lock_timeoutandstatement_timeoutin production. - Use
CREATE INDEX CONCURRENTLY-- never the blocking variant. - Batch backfills with
SKIP LOCKEDto avoid contention. - For major schema overhauls, consider logical replication and a blue-green database cutover.