Zero-Downtime Database Migrations

Running schema migrations on a production PostgreSQL database that serves thousands of requests per second is one of the most nerve-wracking tasks in backend engineering. A single ALTER TABLE ... ADD COLUMN ... NOT NULL can lock your table for minutes, cascade into connection pool exhaustion, and bring your entire platform down. This article covers battle-tested strategies for executing migrations with zero user-facing impact.

The Expand-Contract Pattern

The expand-contract pattern (sometimes called parallel change) splits every destructive migration into three safe, independently deployable phases:

Expand -- Add the new structure alongside the old one. Both coexist.
Migrate -- Backfill data, switch application reads/writes to the new structure.
Contract -- Drop the old structure once nothing references it.

Each phase is a separate deployment. If anything goes wrong, you roll back only that phase, not the entire migration.

Example: Renaming a Column

Suppose you need to rename users.name to users.full_name. A naive ALTER TABLE users RENAME COLUMN takes an ACCESS EXCLUSIVE lock -- the most restrictive lock PostgreSQL offers.

Phase 1: Expand

-- Migration V1: Add the new column (takes ACCESS EXCLUSIVE but instant on PG 11+)
ALTER TABLE users ADD COLUMN full_name TEXT;
 
-- Create a trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_user_name() RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' OR NEW.name IS DISTINCT FROM OLD.name THEN
        NEW.full_name := NEW.name;
    END IF;
    IF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
        NEW.name := NEW.full_name;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER trg_sync_user_name
    BEFORE INSERT OR UPDATE ON users
    FOR EACH ROW EXECUTE FUNCTION sync_user_name();

Phase 2: Migrate

-- Backfill in batches to avoid long-running transactions
DO $$
DECLARE
    batch_size INT := 10000;
    rows_updated INT;
BEGIN
    LOOP
        UPDATE users
        SET full_name = name
        WHERE full_name IS NULL
          AND id IN (
              SELECT id FROM users
              WHERE full_name IS NULL
              LIMIT batch_size
              FOR UPDATE SKIP LOCKED
          );
 
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        RAISE NOTICE 'Updated % rows', rows_updated;
        EXIT WHEN rows_updated = 0;
 
        COMMIT;
        PERFORM pg_sleep(0.1);  -- brief pause to reduce load
    END LOOP;
END $$;

The key details: FOR UPDATE SKIP LOCKED ensures the backfill does not block application queries on the same rows. Batching with a sleep interval keeps replication lag and WAL generation under control.

Phase 3: Contract

After your application code no longer references users.name anywhere (verified via query logging or pg_stat_statements):

-- Drop the sync trigger
DROP TRIGGER trg_sync_user_name ON users;
DROP FUNCTION sync_user_name();
 
-- Drop the old column
ALTER TABLE users DROP COLUMN name;

Blue-Green Database Strategy

For truly large structural changes (splitting a table, changing a primary key type from INT to UUID), the expand-contract approach at the column level may not suffice. Instead, use a blue-green database strategy powered by logical replication.

Provision a "green" replica with the new schema.
Set up pg_logical replication from the "blue" (current) database to the "green" one, applying a transform function that maps old rows into the new schema.
Validate data parity using checksums or row-count comparisons.
Switch application traffic to the green database. Connection strings are managed via PgBouncer or a DNS CNAME flip.
Decommission the blue database after a bake-in period.

-- On the BLUE (source) database
CREATE PUBLICATION blue_pub FOR TABLE users, orders, products;
 
-- On the GREEN (target) database with new schema
CREATE SUBSCRIPTION green_sub
    CONNECTION 'host=blue-db port=5432 dbname=app user=replicator'
    PUBLICATION blue_pub
    WITH (copy_data = true, synchronous_commit = off);

Avoiding Common Lock Pitfalls

PostgreSQL acquires different lock levels for different DDL statements. Here is a quick reference for the most common operations:

| Operation | Lock Level | Safe? | |---|---|---| | ADD COLUMN (nullable, no default) | ACCESS EXCLUSIVE (instant) | Yes | | ADD COLUMN with DEFAULT (PG 11+) | ACCESS EXCLUSIVE (instant) | Yes | | ADD COLUMN NOT NULL (no default) | ACCESS EXCLUSIVE (full rewrite) | No | | CREATE INDEX | SHARE (blocks writes) | No | | CREATE INDEX CONCURRENTLY | SHARE UPDATE EXCLUSIVE | Yes | | DROP COLUMN | ACCESS EXCLUSIVE (instant) | Yes* |

DROP COLUMN only marks the column as invisible in the catalog; it does not rewrite the table. However, subsequent VACUUM FULL will reclaim the space.

Setting a Lock Timeout

Always set a lock_timeout before running DDL in production. If the lock cannot be acquired within the timeout, the migration fails fast instead of queuing behind long-running queries and causing a cascading pileup.

SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN tracking_number TEXT;
-- If the lock isn't acquired within 5 seconds, this fails
-- and you can retry during a quieter window.

Tooling Recommendations

pgroll -- Xata's open-source tool that automates the expand-contract pattern with version-aware views.
reshape -- Another zero-downtime migration tool that creates shadow columns and triggers automatically.
pg-osc -- Stripe's online schema change tool, inspired by MySQL's pt-online-schema-change.
sqitch -- Dependency-aware migration manager that works well with the phased approach.

Key Takeaways

Never run destructive DDL in a single step on a live database.
Split every migration into expand, migrate, and contract phases.
Always set lock_timeout and statement_timeout in production.
Use CREATE INDEX CONCURRENTLY -- never the blocking variant.
Batch backfills with SKIP LOCKED to avoid contention.
For major schema overhauls, consider logical replication and a blue-green database cutover.