Safe Migrations

Problem / Context

You need to add columns, indexes, and constraints to production tables that serve live traffic. The goal is to change the schema safely with minimal blocking, clear rollback or roll‑forward options, and good communication.

Core Concept

Prefer two-phase changes: declare fast metadata first, move data gradually, validate last.
Avoid long ACCESS EXCLUSIVE locks on hot tables; use CONCURRENTLY and NOT VALID where possible.
Batch backfills, watch locks and replication lag, and be ready to roll forward instead of rolling back.

Goal: Apply DDL in production with minimal locks, clear rollback or roll-forward, and good communication.

1. Core Principles

Small steps beat giant scripts.
Wrap changes in a transaction when allowed (not for CREATE INDEX CONCURRENTLY, some ALTER TYPE).
Avoid long ACCESS EXCLUSIVE locks on hot tables.
Be able to pause or roll forward. Rolling back large partial changes is risky.

2. Standard Playbook

Plan (list risk ops: type changes, defaulted columns, big table rewrites)
Test on staging clone (row counts similar)
Backup or confirm PITR window healthy
Communicate window + fallback
Run migration tool (Flyway, Sqitch, Rails, etc.)
Monitor: locks, errors, replication lag
Validate post-change (schema, sample queries)

3. Change Catalog (where to find details)

This chapter is the high-level playbook. Detailed, copyable SQL lives in focused chapters:

Add/rename/drop columns, backfills, SELECT-list safety → see Safe Column Changes
Table-level ops (rename table, split hot/cold, lock watching) → see Safe Table Changes
Type evolution (enums, int→bigint, precision) → see Safe Type Changes
Constraints (NOT VALID/VALIDATE, UNIQUE via index, NOT NULL safely) → see Constraints
Roles & users for migration separation of duties → see Roles & Users

Quick references

Indexes on hot tables: CREATE INDEX CONCURRENTLY (details in Safe Table Changes)
Foreign keys on existing data: ADD … NOT VALID then VALIDATE (details in Constraints)
Large type changes: shadow column + swap (details in Safe Type Changes)
Backfills: PK-ordered batches with small commits (details in Safe Column Changes)

4. Rollback vs Roll Forward

DDL rollback can be messy. Prefer forward fixes: create a missing index, adjust a new column, etc. Keep a quick revert script only for trivial additions if disaster strikes early.

5. Monitoring During Migration

Watch:

pg_stat_activity for waiting queries
pg_locks for long ACCESS EXCLUSIVE
replication lag (if streaming replicas) to catch cascading delay

6. Checklist Before You Run

Tested on realistic data size?
Backup / PITR window verified?
Scripts idempotent (re-run safe)?
Long rewrites split into shadow columns?
FKs / CHECK added with NOT VALID when large?
Plan to backfill in batches documented?
Communication + monitoring in place?

7. Common Pitfalls

Mistake	Pain
Adding column WITH default on huge table (old versions)	Full table rewrite & lock
Single transaction with many long steps	Lock pileups, hard rollback
No batch size limit in backfill	Long running transaction, bloat
Dropping index before verifying replacement	Sudden query regression

Start minimal, measure, iterate. Safety is structure + patience.

8. Variations and Trade‑Offs

Tooling: Framework migrations vs. declarative tools (Sqitch/Flyway). Declarative reduces drift; framework integrates with app deploys.
Zero-downtime app changes: Write‑through compatibility (support both schemas for a deploy), then cut over.
Blue/green vs in‑place: Blue/green isolates risk but costs infra; in‑place is cheaper but needs careful lock control.

9. Pitfalls

Adding defaults on very large tables in older Postgres versions triggers a table rewrite and long lock.
Running CREATE INDEX (non‑CONCURRENTLY) on hot tables blocks writes.
Single huge UPDATE backfills cause long transactions and bloat; always batch.

10. Recap (Short Summary)

Declare structure fast, migrate data gradually, validate correctness later. Use CONCURRENTLY and NOT VALID to keep locks short, and favor roll‑forward fixes.

11. Optional Exercises

Add a NULLable column on a big table, backfill in batches, then set NOT NULL and DEFAULT.
Build a UNIQUE constraint by creating the index CONCURRENTLY and attaching it.
Add a FOREIGN KEY with NOT VALID, fix violations, then VALIDATE in a quiet window.

12. Summary (Cheat Sheet)

Planning: Classify risky ops (rewrite, long scan) upfront. Impact: none. Mitigation: avoid surprise long locks. Tip: estimate sizes with pg_relation_size.
Adding Column: Add NULLable without default; batch backfill; add default + NOT NULL last. Impact: metadata then short locks. Risk: table rewrite/long lock (older versions). Tip: PG 11+ skips rewrite for constant defaults.
Backfill: Use small ordered batches (PK range/LIMIT) with commits. Impact: short transactions. Risk: bloat, autovacuum pressure. Tip: stop when 0 rows updated.
Creating Index: Use CREATE INDEX CONCURRENTLY on hot/big tables. Impact: minimal blocking; two scans. Risk: read/write blocking avoided. Tip: not inside a transaction.
Unique Constraint: Build unique index CONCURRENTLY then attach constraint. Impact: same as above. Risk: avoid long ACCESS EXCLUSIVE lock.
Foreign Key: Add NOT VALID then VALIDATE later. Impact: fast add; later read-only scan. Risk: long blocking validation avoided. Tip: validate in low-traffic window.
Type Change (Large): Shadow column + batch copy + swap. Impact: bounded locks. Risk: table-wide rewrite avoided. Tip: keep old column until confident.
Monitoring: Watch pg_stat_activity, pg_locks, replication lag. Impact: early detection. Risk: silent lock pileups.
Rollback Strategy: Prefer roll forward fixes. Impact: avoids complex undo. Risk: half-applied states mitigated. Tip: keep tiny early revert script only.
Checklist Use: Run pre-flight checklist before executing. Impact: none. Risk: missing safety steps. Tip: automate where possible.
Pitfalls: WITH DEFAULT on huge table (older PG) → split add + default later; massive single UPDATE → batch + commit; no validation of new constraint → schedule VALIDATE and track.

Key Principle: Declare structure fast (metadata), migrate data gradually, validate correctness late.

References

CREATE INDEX CONCURRENTLY: https://www.postgresql.org/docs/current/sql-createindex.html#createindex-concurrently
VALIDATE CONSTRAINT: https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-VALIDATE-CONSTRAINT
Explicit locking overview: https://www.postgresql.org/docs/current/explicit-locking.html
Active sessions (pg_stat_activity): https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY
Locks view (pg_locks): https://www.postgresql.org/docs/current/view-pg-locks.html
Replication lag (pg_stat_replication): https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION