DB Migration Checklist - sul-dlss/preservation_catalog GitHub Wiki

Checklist for doing a Database Migration

Suggestion: "edit" this page and copy the raw text to an issue comment on YOUR migration issue or PR to create a checklist specific to your migration.

have a clean rollback path with the migration (e.g. the database migration can be rolled back, plus any additional steps needed)

First try it on stage or qa, including the rollback

dry run on QA
- pause google books
- watch Honeybadger for errors
- stop pres robots, pres cat workers, pres cat web services
  - pres cat workers (from preservation_catalog project): bundle exec cap qa sidekiq_systemd:stop
  - pres robots workers (from preservation_robots project): bundle exec cap qa resque:pool:stop
  - pres cat web servers (ksued to root from each VM): service apache2 stop <-- this is "just in case" as only nagios okcomputer should be querying prescat at this point
- take backup snapshot
  - from DB server, ksued to root: su - postgres -c "pg_dump -Fc pres > ~/useful_title_premigration_snapshot_$(date --iso-8601=seconds).sql"
  - will result in file at ~postgres/'useful_title_premigration_snapshot_2022-11-30T17:51:57-08:00.sql'
- start pres cat web servers (ksued to root from each VM): service apache2 start
- deploy migration with cap (bundle exec cap qa deploy from preservation_catalog, your-migration-branch)
- start pres robots (deployment should've restarted pres cat workers automatically)
  - start pres robots workers (from preservation_robots project): bundle exec cap qa resque:pool:hot_swap
- test the migration
  - run preassembly_image_accessioning_spec.rb and preassembly_hfs_accessioning_spec.rb integration tests
    - SDR_ENV=qa bundle exec rspec spec/features/preassembly_image_accessioning_spec.rb spec/features/preassembly_hfs_accessioning_spec.rb
    - note the druids
  - on the test objects created by that, run CV and replication audit, and manually examine the Moabs in the storage roots (just a quick listing to make sure they look right)
- now revert and test the rollback, too!
- stop pres robots, pres cat workers, pres cat web services
  - pres cat workers (from preservation_catalog project): bundle exec cap qa sidekiq_systemd:stop
  - pres robots workers (from preservation_robots project): bundle exec cap qa resque:pool:stop
  - pres cat web servers (ksued to root from each VM): service apache2 stop
- restore the DB from snapshot
  - from a pres qa application server, drop the database
  - from DB server, ksued to root: re-apply the puppet production branch so that the pres DB is properly recreated (see here)
  - from DB server, ksued to root: su - postgres -c "pg_restore -d pres ~postgres/'useful_title_premigration_snapshot_2022-11-30T17:51:57-08:00.sql'"
- re-deploy main. capistrano may complain about passenger not running, but the deploy should otherwise be fine, and the symlink to that latest deployment should remain in place.
- if the deployment didn't do so, re-start pres cat web servers (ksued to root from each VM): service apache2 start
- if the deployment didn't do so, start pres cat workers: bundle exec cap qa cap sidekiq_systemd:restart from preservation_catalog project.
- start pres robots workers (bundle exec cap qa resque:pool:hot_swap from preservation_robots project)
- start google books again
- run CatalogUtils.check_existence_for_druid on each of the two test druids from above. both should be added to the catalog (yes, afer the db:rollback).
- run the same two integration tests against QA and make sure they pass

AFTER it runs cleanly on stage/qa, including the rollback

make sure diff from release tag to main, e.g. (https://github.com/sul-dlss/preservation_catalog/compare/rel-2022-11-28...main) looks deployable to prod

Deploy to Prod

Deploy to Prod
- rebase branch as needed
- coordinate with ops (at a minimum to let them know; you may want to request their help for the db backup)
- merge PR to main
- pause google books
- watch Honeybadger for errors
- stop pres robots, pres cat workers, pres cat web services
  - pres cat workers (from preservation_catalog project): bundle exec cap prod sidekiq_systemd:stop
    - verify https://preservation-catalog-web-prod-01.stanford.edu/queues/busy
  - pres robots workers (from preservation_robots project): bundle exec cap prod resque:pool:stop
    - verify https://robot-console-prod.stanford.edu/workers
  - pres cat web servers (ksued to root from each VM): service apache2 stop
- take backup snapshot
  - from DB server, ksued to root: su - postgres -c "pg_dump -Fc pres > ~/useful_title_premigration_snapshot_$(date --iso-8601=seconds).sql"
  - the file should be at ~pres/'useful_title_premigration_snapshot_2022-12-02T08:09:28-08:00.sql'
- start pres cat web servers (ksued to root from each VM): service apache2 start
- deploy migration (bundle exec cap prod deploy from preservation_catalog, your-migration branch (or main))
  - if no error, proceed, if error, restore DB from snapshot and abort
- verify prescat workers restarted with deploy https://preservation-catalog-web-prod-01.stanford.edu/queues/busy
- start pres robots (from preservation_robots project): bundle exec cap prod resque:pool:hot_swap
  - verify https://robot-console-prod.stanford.edu/workers
- start google books again
- create a new version on a test object in prod (search on "test object" in argo).
  - run checksum validation for this object from rails console
  - run replication audit on this object from rails console. If you want to be super thorough about investigating replication success, you can use the code snippets in this wiki entry to check the cloud manually/synchronously and to get relevant query results.
  - manually examine the Moabs in the storage roots (just a quick listing to make sure they look right).
  - if no problems detected, all good, proceed.
  - if problems
    - consider aborting: shutting down gbooks/presbots/prescat, restoring DB from snapshot, re-deploying weekly release tag, and running CatalogUtils.check_existence_for_druid on the recently updated druid. if you do that, turn this into a real sub-checklist first to keep track of things.

Deploy to QA and Stage

If prod deploy succeeded, deploy main to QA and Stage

QA
stage