MIGRATION SAFETY GUIDE - nself-org/nchat GitHub Wiki
Task 139: Data migration and rollback rehearsals
This guide provides comprehensive procedures for safely executing database migrations in production environments. All migrations must be tested, reviewed, and executed following these protocols.
- Pre-Migration Checklist
- Migration Execution Procedure
- Rollback Procedures
- Zero-Downtime Migrations
- Data Integrity Verification
- Emergency Procedures
- Post-Migration Checklist
- Migration tested in local development environment
- Unit tests pass for affected database interactions
- Integration tests pass
- Rollback script generated and tested
- Performance impact assessed
- Migration applied to staging environment
- Application functions correctly after migration
- Rollback tested successfully in staging
- Data integrity verified post-migration
- Performance metrics collected
- Migration complexity assessed (Low/Medium/High)
- Breaking changes identified
- Affected tables and relationships documented
- Downtime requirement estimated
- Rollback plan documented
- Stakeholders notified of planned migration
- Maintenance window scheduled (if needed)
- Rollback decision tree prepared
- Support team briefed
- Full database backup completed
- Backup integrity verified
- Backup restoration tested
- Backup retention policy confirmed
- Point-in-time recovery (PITR) enabled
- Application code compatible with old schema (if zero-downtime)
- Application code compatible with new schema
- Feature flags configured (if needed)
- Deployment order documented
# 1. Create backup
pg_dump -Fc nself_production > backup_$(date +%Y%m%d_%H%M%S).dump
# 2. Verify backup
pg_restore --list backup_*.dump | wc -l
# 3. Put application in maintenance mode
kubectl scale deployment nself-chat --replicas=0
# 4. Wait for all connections to close
SELECT count(*) FROM pg_stat_activity WHERE datname = 'nself_production';
# 5. Apply migration
psql nself_production < migrations/XXX_migration.sql
# 6. Verify migration
psql nself_production -c "SELECT * FROM information_schema.tables WHERE table_name = 'new_table';"
# 7. Run data integrity checks
npm run migration:verify
# 8. Restore application
kubectl scale deployment nself-chat --replicas=3
# 9. Monitor application logs
kubectl logs -f deployment/nself-chatPhase 1: Additive Changes (No Downtime)
-- Add new columns as nullable
ALTER TABLE users ADD COLUMN new_field TEXT NULL;
-- Create new tables
CREATE TABLE new_feature_table (...);
-- Create new indexes concurrently
CREATE INDEX CONCURRENTLY idx_users_new_field ON users(new_field);Phase 2: Data Migration (Background)
-- Migrate data in batches
UPDATE users
SET new_field = legacy_field
WHERE new_field IS NULL
AND id >= $1 AND id < $2;
-- Use advisory locks to prevent conflicts
SELECT pg_advisory_lock(123456);
-- ... perform migration ...
SELECT pg_advisory_unlock(123456);Phase 3: Application Deployment
# Deploy new application version that uses new schema
kubectl set image deployment/nself-chat nself-chat=nself-chat:v2.0.0
# Monitor rollout
kubectl rollout status deployment/nself-chatPhase 4: Cleanup (After Verification)
-- Make columns NOT NULL if needed
ALTER TABLE users ALTER COLUMN new_field SET NOT NULL;
-- Drop old columns
ALTER TABLE users DROP COLUMN legacy_field;
-- Drop old indexes
DROP INDEX idx_users_legacy_field;Execute rollback immediately if:
- Migration fails with errors
- Application health checks fail after migration
- Data integrity checks fail
- Error rate exceeds 5% in first 10 minutes
- Performance degrades by >50%
# 1. Put application in maintenance mode (if not already)
kubectl scale deployment nself-chat --replicas=0
# 2. Execute rollback script
psql nself_production < migrations/XXX_migration.rollback.sql
# 3. If rollback fails, restore from backup
pg_restore -d nself_production backup_*.dump
# 4. Verify database state
npm run migration:verify
# 5. Restore previous application version
kubectl set image deployment/nself-chat nself-chat=nself-chat:v1.9.0
# 6. Restore application
kubectl scale deployment nself-chat --replicas=3
# 7. Monitor and verify
kubectl logs -f deployment/nself-chatMigration Applied
├─ Errors during migration?
│ └─ YES → Rollback immediately
└─ NO → Continue
├─ Application starts successfully?
│ ├─ NO → Rollback immediately
│ └─ YES → Continue
│ ├─ Health checks passing?
│ │ ├─ NO → Rollback immediately
│ │ └─ YES → Continue
│ │ ├─ Error rate < 5%?
│ │ │ ├─ NO → Rollback immediately
│ │ │ └─ YES → Continue
│ │ │ └─ Monitor for 1 hour
│ │ │ ├─ Issues detected? → Rollback
│ │ │ └─ All clear → Success!
Safe Changes (Zero Downtime)
- Adding nullable columns
- Creating new tables
- Creating new indexes (CONCURRENTLY)
- Adding CHECK constraints (NOT VALID, then VALIDATE)
- Adding foreign keys (NOT VALID, then VALIDATE)
Example: Adding a Column
-- Phase 1: Add column as nullable
ALTER TABLE users ADD COLUMN email_verified BOOLEAN NULL DEFAULT false;
-- Application can now be deployed (handles both null and boolean)
-- Phase 2: Backfill data
UPDATE users SET email_verified = true WHERE email_confirmed_at IS NOT NULL;
-- Phase 3: Make NOT NULL
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;Example: Adding a Foreign Key
-- Phase 1: Add FK as NOT VALID (doesn't check existing data)
ALTER TABLE messages
ADD CONSTRAINT fk_messages_user_id
FOREIGN KEY (user_id) REFERENCES users(id)
NOT VALID;
-- Phase 2: Fix any invalid data
UPDATE messages m
SET user_id = (SELECT id FROM users WHERE username = 'deleted_user' LIMIT 1)
WHERE NOT EXISTS (SELECT 1 FROM users u WHERE u.id = m.user_id);
-- Phase 3: Validate constraint
ALTER TABLE messages VALIDATE CONSTRAINT fk_messages_user_id;Requires Downtime or Multi-Phase
- Dropping columns (unless unused)
- Renaming columns
- Changing column types
- Adding NOT NULL constraints to existing columns
- Dropping tables
Multi-Phase Approach for Renaming
-- Phase 1: Add new column
ALTER TABLE users ADD COLUMN display_name TEXT NULL;
-- Phase 2: Dual-write (application writes to both)
-- Deploy app version that writes to both columns
-- Phase 3: Backfill
UPDATE users SET display_name = full_name WHERE display_name IS NULL;
-- Phase 4: Switch reads
-- Deploy app version that reads from display_name
-- Phase 5: Remove old column
ALTER TABLE users DROP COLUMN full_name;-- Check foreign key constraints
SELECT
conname,
conrelid::regclass AS table_name,
confrelid::regclass AS foreign_table,
pg_get_constraintdef(oid) AS constraint_def
FROM pg_constraint
WHERE contype = 'f'
AND connamespace::regnamespace::text NOT IN ('pg_catalog', 'information_schema')
ORDER BY conrelid::regclass::text, conname;
-- Check for orphaned records
SELECT
'messages' AS table_name,
COUNT(*) AS orphaned_count
FROM nchat.nchat_messages m
WHERE NOT EXISTS (
SELECT 1 FROM nchat.nchat_users u WHERE u.id = m.user_id
);
-- Check unique constraints
SELECT
table_name,
constraint_name,
constraint_type
FROM information_schema.table_constraints
WHERE constraint_type = 'UNIQUE'
AND table_schema = 'nchat';
-- Check NOT NULL constraints
SELECT
table_name,
column_name,
is_nullable
FROM information_schema.columns
WHERE table_schema = 'nchat'
AND is_nullable = 'NO'
ORDER BY table_name, column_name;
-- Check index validity
SELECT
schemaname,
tablename,
indexname,
indexdef
FROM pg_indexes
WHERE schemaname = 'nchat'
AND indexname NOT LIKE 'pg_%'
ORDER BY tablename, indexname;# Run integrity check script
npm run migration:verify
# Run integration tests against production database
npm run test:integration:production
# Check data counts
psql -c "SELECT 'users' AS table, COUNT(*) FROM nchat.nchat_users
UNION ALL
SELECT 'messages', COUNT(*) FROM nchat.nchat_messages
UNION ALL
SELECT 'channels', COUNT(*) FROM nchat.nchat_channels;"Symptoms:
- Database errors preventing application startup
- Data corruption detected
- Complete service outage
Response:
# 1. IMMEDIATE: Scale down application
kubectl scale deployment nself-chat --replicas=0
# 2. Assess database state
psql nself_production -c "SELECT version();"
psql nself_production -c "SELECT COUNT(*) FROM nchat.nchat_users;"
# 3. Attempt rollback
psql nself_production < migrations/XXX_migration.rollback.sql
# 4. If rollback fails, restore from backup
pg_restore -d nself_production -c backup_latest.dump
# 5. Verify restoration
npm run migration:verify
# 6. Restore previous app version
kubectl set image deployment/nself-chat nself-chat=nself-chat:previous
# 7. Scale up
kubectl scale deployment nself-chat --replicas=3
# 8. Notify stakeholders
# Post incident report with timelineSymptoms:
- Some tables/indexes created, others failed
- Application partially functional
- Some features broken
Response:
# 1. Document current state
psql nself_production -c "\dt nchat.*" > migration_state.txt
psql nself_production -c "\di nchat.*" >> migration_state.txt
# 2. Analyze partial migration
diff migrations/XXX_migration.sql migration_state.txt
# 3. Create custom rollback for partial state
# (Manual SQL based on what was applied)
# 4. Execute custom rollback
psql nself_production < custom_rollback.sql
# 5. Verify and restore
npm run migration:verify- Application health checks passing
- Error rate within normal range (<1%)
- Response times within SLA
- Database connection pool stable
- No deadlocks or lock timeouts
- Critical user flows tested manually
- Performance metrics compared to baseline
- Database size/growth monitored
- Query performance analyzed
- User-reported issues tracked
- Automated tests passing
- Backup schedule resumed
- No regression detected in analytics
- Resource utilization stable
- Cost impact assessed
- Documentation updated
- Lessons learned documented
- Rollback scripts archived
- Adding nullable columns
- Creating new tables (unused initially)
- Creating indexes on small tables (<1M rows)
- Adding comments
- Creating views
Risk Level: 🟢 Low Downtime: None Testing Required: Development + Staging
- Adding NOT NULL columns with defaults
- Creating indexes on large tables (>1M rows)
- Adding foreign keys
- Modifying non-critical functions
- Batch data updates
Risk Level: 🟡 Medium Downtime: 0-5 minutes Testing Required: Development + Staging + Performance testing
- Dropping columns with data
- Renaming columns
- Changing column types
- Dropping tables
- Major schema restructuring
- Large data migrations
Risk Level: 🔴 High Downtime: 5-60 minutes Testing Required: Development + Staging + Load testing + Rehearsal
# Database connections
SELECT count(*) FROM pg_stat_activity;
# Long-running queries
SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start ASC;
# Lock waits
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.usename AS blocked_user,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
# Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'nchat'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;# Health endpoint
curl -f http://localhost:3000/api/health
# Error rates (from logs)
kubectl logs -l app=nself-chat --tail=1000 | grep -i error | wc -l
# Response times
kubectl logs -l app=nself-chat --tail=1000 | grep "response_time" | awk '{sum+=$NF; count++} END {print sum/count}'# Generate rollback scripts
npm run migration:generate-rollbacks
# Test migrations in isolated environment
npm run migration:test
# Test specific migration
npm run migration:test -- 051_user_settings_tables.sql
# Performance test
npm run migration:test:perf- Forward migration succeeds
- Rollback migration succeeds
- Data integrity maintained
- Application functions with new schema
- Application functions after rollback
- Performance acceptable (<30s for migration)
- No orphaned data
- All constraints valid
-- Safe: nullable with default
ALTER TABLE nchat.nchat_users
ADD COLUMN feature_enabled BOOLEAN NULL DEFAULT false;
-- Make NOT NULL after backfill
ALTER TABLE nchat.nchat_users
ALTER COLUMN feature_enabled SET NOT NULL;-- Phase 1: Add new columns
ALTER TABLE users ADD COLUMN first_name TEXT NULL;
ALTER TABLE users ADD COLUMN last_name TEXT NULL;
-- Phase 2: Backfill
UPDATE users
SET
first_name = split_part(full_name, ' ', 1),
last_name = split_part(full_name, ' ', 2)
WHERE first_name IS NULL;
-- Phase 3: Drop old column
ALTER TABLE users DROP COLUMN full_name;-- Use CONCURRENTLY to avoid blocking writes
CREATE INDEX CONCURRENTLY idx_messages_created_at
ON nchat.nchat_messages(created_at DESC);
-- Check index is valid
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE indexname = 'idx_messages_created_at';-- Rollback for XXX_migration_name.sql
-- Generated: YYYY-MM-DD
-- Complexity: [LOW|MEDIUM|HIGH]
-- Estimated Duration: XX minutes
-- WARNING: Test this rollback script thoroughly before use!
BEGIN;
-- Create backup of affected tables (if needed)
CREATE TABLE backup_table_name AS SELECT * FROM original_table;
-- Revert changes in reverse order
-- 1. Drop triggers
DROP TRIGGER IF EXISTS trigger_name ON table_name CASCADE;
-- 2. Drop views
DROP VIEW IF EXISTS view_name CASCADE;
-- 3. Revert table alterations
ALTER TABLE table_name DROP COLUMN IF EXISTS new_column CASCADE;
ALTER TABLE table_name DROP CONSTRAINT IF EXISTS constraint_name CASCADE;
-- 4. Drop indexes
DROP INDEX IF EXISTS index_name CASCADE;
-- 5. Drop functions
DROP FUNCTION IF EXISTS function_name CASCADE;
-- 6. Drop tables (WARNING: This deletes data!)
DROP TABLE IF EXISTS table_name CASCADE;
-- Verify rollback
SELECT COUNT(*) FROM original_table;
COMMIT;
-- Rollback complete
-- Next steps:
-- 1. Verify application functionality
-- 2. Check data integrity
-- 3. Monitor error rates- Approval Required: No
- Review Required: No
- Documentation: Optional
- Approval Required: Technical Lead
- Review Required: Peer review
- Documentation: Migration plan
- Approval Required: Technical Lead
- Review Required: Peer review + QA
- Documentation: Migration plan + rollback script
- Approval Required: Technical Lead + Product Owner
- Review Required: Peer review + QA + Load testing
- Documentation: Full migration playbook
- Approval Required: Technical Lead + Product Owner + CTO
- Review Required: Architecture review + Security review
- Documentation: Full migration playbook + rehearsal report + stakeholder communication
-
scripts/test-migrations.ts- Automated migration testing framework -
scripts/generate-rollbacks.ts- Automatic rollback script generation -
scripts/verify-integrity.ts- Database integrity verification
# Test all migrations
npm run migration:test
# Generate rollback scripts
npm run migration:generate-rollbacks
# Verify data integrity
npm run migration:verify
# Performance test
npm run migration:test:perfP0 - Critical
- Complete service outage
- Data corruption
- Security breach
- Response Time: Immediate
- Rollback Decision: Automatic
P1 - High
- Major feature broken
- Performance degradation >50%
- Error rate >10%
- Response Time: 15 minutes
- Rollback Decision: After 30 min assessment
P2 - Medium
- Minor feature broken
- Performance degradation 25-50%
- Error rate 5-10%
- Response Time: 1 hour
- Rollback Decision: After 2 hour assessment
P3 - Low
- Cosmetic issues
- Performance degradation <25%
- Error rate <5%
- Response Time: Next business day
- Rollback Decision: Not required
✅ Test migrations in development first ✅ Generate and test rollback scripts ✅ Use transactions for atomic changes ✅ Create database backups before migrations ✅ Monitor application after deployment ✅ Document all changes ✅ Use feature flags for risky changes ✅ Communicate with stakeholders ✅ Verify data integrity post-migration ✅ Keep migrations small and focused
❌ Skip testing in staging ❌ Apply migrations during peak hours (unless zero-downtime) ❌ Drop columns without backup ❌ Make multiple unrelated changes in one migration ❌ Assume rollback will always work ❌ Deploy code before database migration ❌ Ignore performance impact on large tables ❌ Forget to update documentation ❌ Rush through post-migration verification ❌ Skip the rollback rehearsal
- Technical Lead: [Contact info]
- DevOps Team: [Contact info]
- On-Call Engineer: [On-call rotation]
- On-Call Engineer (immediate)
- Technical Lead (if unresolved in 15 min)
- CTO (if unresolved in 1 hour)
-- 054_add_message_bookmarks.sql
-- Add bookmarking feature to messages
-- Complexity: LOW
-- Estimated Duration: 30 seconds
BEGIN;
-- Create bookmarks table
CREATE TABLE IF NOT EXISTS nchat.nchat_message_bookmarks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
message_id UUID NOT NULL REFERENCES nchat.nchat_messages(id) ON DELETE CASCADE,
user_id UUID NOT NULL REFERENCES nchat.nchat_users(id) ON DELETE CASCADE,
notes TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(message_id, user_id)
);
-- Create indexes
CREATE INDEX idx_message_bookmarks_user
ON nchat.nchat_message_bookmarks(user_id, created_at DESC);
CREATE INDEX idx_message_bookmarks_message
ON nchat.nchat_message_bookmarks(message_id);
-- Add comment
COMMENT ON TABLE nchat.nchat_message_bookmarks IS 'User bookmarks for messages';
COMMIT;-- 054_add_message_bookmarks.rollback.sql
-- Rollback for message bookmarking feature
-- Generated: 2026-02-09
BEGIN;
-- Drop indexes
DROP INDEX IF EXISTS nchat.idx_message_bookmarks_user CASCADE;
DROP INDEX IF EXISTS nchat.idx_message_bookmarks_message CASCADE;
-- Drop table
DROP TABLE IF EXISTS nchat.nchat_message_bookmarks CASCADE;
COMMIT;- v1.0.0 (2026-02-09): Initial version
- Task 139: Data migration and rollback rehearsals
- PostgreSQL Documentation: https://www.postgresql.org/docs/
- Zero-Downtime Migrations: https://fly.io/ruby-dispatch/zero-downtime-postgres-migrations/
- nself CLI Documentation: ~/Sites/nself/README.md
- Project Database Schema: .backend/migrations/