Production Database Migration Guide - sgajbi/portfolio-analytics-system GitHub Wiki

Overview

Database migrations in production must be safe, tested, and reversible. This guide outlines:

  • Pre-checks before migration.
  • Applying migrations with minimal downtime.
  • Rollback and recovery strategy.

Migration Philosophy

  • Forward-only migrations in production (downgrades are for dev/staging only).
  • Zero downtime goal — migrations should not block event processing for long periods.
  • Pre-tested migrations — all changes must be validated in staging with production-like data before live deployment.

1️⃣ Pre-Migration Checklist

Code & Migration Readiness

  • Confirm migration script exists in alembic/versions/ with descriptive message.
  • Validate migration correctness by running in staging.
  • Check models match DB schema after applying migration.

Data Safety

  • Ensure database backups are taken:
pg_dump -Fc -h <prod-host> -U <user> <db-name> > backup_before_migration.dump

Communication

  • Notify:

    • DevOps team
    • Application owners
    • Operations team (Ops will monitor during migration)
  • Schedule maintenance window if downtime is expected.


2️⃣ Applying the Migration in Production

Step 1: Put Services in Safe Mode

  • For schema changes that affect write operations:

    • Pause Kafka consumers (to avoid writes during migration).
    • Disable CI/CD auto-deploy temporarily.

Step 2: Run Migration

  • Execute Alembic upgrade:
alembic upgrade head

Step 3: Validate Schema

  • Check table definitions:
\d processed_events;
  • Confirm new columns/indexes exist.
  • Check existing records for integrity.

Step 4: Resume Services

  • Restart Kafka consumers.
  • Re-enable CI/CD deployments.

3️⃣ Post-Migration Verification

Functional Validation

  • Trigger a test transaction in ingestion.
  • Confirm correlation ID and idempotent processing still work.
  • Query API to confirm results:
curl -H "X-Correlation-ID: TEST:uuid" https://<api>/portfolios/1001/positions

Operational Validation

  • Monitor logs in Splunk for errors.
  • Check Kafka consumer lag.
  • Verify no increase in DB error rates.

4️⃣ Rollback Strategy

Direct downgrade is discouraged in production. Instead:

  1. Apply forward-only fix migration if rollback is needed.

  2. If data corruption occurs:

    • Restore from backup.
    • Redeploy previous stable application version.

Best Practices

  • Group non-breaking changes (adding columns, indexes) separately from breaking changes (renaming/dropping columns).
  • Use default values + nullable columns for safe schema evolution.
  • Large data changes (backfills) should be handled via batch jobs, not Alembic migrations.

⚠️ **GitHub.com Fallback** ⚠️