Storage Migration Additional Information - sul-dlss/preservation_catalog GitHub Wiki

General Notes for Storage Migration

Goal: Migrate Storage Roots Without Introducing New Errors

What follows are some more detailed notes and advice that used to be inline in the checklists. Given that we're doing 1x roots in a single migration weekend, having lots of text in the separate checklists can be more annoying that helpful.

Keep an eye out for migration related Honeybadger errors from preservation_catalog and from preservation_robots.

preservation-catalog: https://app.honeybadger.io/projects/54415/faults?q=-is%3Aresolved+-is%3Aignored
preservation_robots: https://app.honeybadger.io/projects/55564/faults?q=-is%3Aresolved+-is%3Aignored

About shared_configs PRs

The instructions recommend PRs for shared_configs changes. It would be great to get those PRs reviewed before merging via the usual process, but if no one is available to review them, use your judgment. Don't let the migration stall while waiting for review -- just cross check your work thoroughly yourself. A PR helps with this. For extra assurance, e.g., try copying and pasting the path for a new storage root entry from the shared_configs PR, and then lsing it from a pres cat VM to make sure it doesn't contain typos.

Notes & advice for running validation and reporting commands

From where do I run reporting and validation rake tasks?

Reporting and validation rake tasks should be run from a production preservation catalog VM. It may be easiest to always use preservation-catalog-prod-01 for consistency.

What string am I supposed to use for a storage root?

The rake tasks take the storage root name (stor_root_name used as placeholder in the instructions). Storage root names are the shorter more readable aliases for the full paths to the storage roots (i.e. the keys in storage_root_map in shared configs, used for the name field in the moab_storage_roots table).

What should I check for runtime problems while validations are running?

requeue failed jobs / ensure no jobs failed via resque GUI https://preservation-catalog-prod-01.stanford.edu/resque/overview
watch for errors in Honeybadger

Can I see validation errors while validations are running?

Yes. Validation errors are stored in prescat database.

You may run audit error reports to see what errors, if any, are known to the database. There are also status queries in the DB README.

What actually runs the validation checks?

The validation rake tasks are queueing jobs to do the validation actual work:
- This still might take tens of minutes on large storage roots, so you may want to run those rake tasks from a screen session.
- The rake task will report errors queueing the jobs, but not the results of the jobs.
- The rake task will likely finish well before the jobs it queues are all finished.

Do audit check errors step on each other?

Not all checks are able to validate CompleteMoabs in with any status. For example, M2C and C2M will decline to validate a CompleteMoab with an invalid_checksum status. This is why we want to run these checks sequentially, not simultaneously, and we want error reports generated after each check completes, before the next one starts.

Do reports trigger auditing?

No. Running reports will not actually trigger any auditing, it'll just query the database for the status info that was set during the most recently run audits (that's why the report should run quickly).

Do we need more workers for reports?

If you're not confident that CV and M2C will finish in the migration window, giving them more resque workers may help up to a point. Talk to Ops. Note that CV requires CPU to generate checksums, so upping those workers could peg out the worker VMs. Additionally, file reads require network bandwidth (because storage roots are mounted via NFS), and so may eventually hit IO limits. Discuss with ops if this is a concern.

What if CV doesn't automatically kick off workers?

After migration, we expect CV to automatically kick off workers.
If this doesn't happen, we will need to run CV manually for the new storage brick.
- When testing the checklist, CV jobs failed to queue automatically -- we suspect because CV had just been run on the druids in question, and the resque lock for de-duping jobs was still hanging around.
It is also possible to run CV for a list of druids - it's documented in the prescat README.
- The list of druids for each migrated OLD storage root is the 'druids_b4' report for the old storage root in /opt/app/pres/preservation_catalog/current/log/reports.

CV -- why we won't run this anew right before the migration

We had originally planned to run all objects to be migrated through checksum validation the week before the cutover. Testing indicates that this won't be possible in the time we have with the computational and storage resourcing we have. Additionally, checksum validation has been running as expected on production (as of 2020-04-15, our least recently checksum validated object on production was checked at 2020-01-13 11:03:05 UTC). We have other safeguards and assurances in the migration process and in our normal workflow that should give peace of mind re: integrity of preserved objects (C2M before and after the move, FS block level integrity checking built into the rsync invocation that copies content from old to new storage, FS block level integrity checking on the originating NetApp filer, ZFS on the new storage brick, independent rsyncing of the same content to both the primary and backup storage bricks).

cron jobs are deployed on preservation-catalog-prod-02

cron jobs are deployed on preservation-catalog-prod-02 (technically there are also cron jobs defined on -04, but we're not concerned about them here).

What About the Actual Migration?

if we add a new storage root (the first time migrating data to a particular new storage brick), we also need to add the root to the prescat db (run this from your laptop):
```
bundle exec cap prod db_seed
```
(yes, it's db_seed via capistrano)
we will run a rake migration task to tell prescat that Moabs have moved (as with the reporting and auditing rake tasks, provide the storage root names, not their full paths)
- NOTE You should run this task in screen, as it can take a couple minutes, and you don't want to interrupt it in the middle.
```
# from your laptop:
$ bundle exec cap prod ssh

# on the server, from 'current' directory:
RAILS_ENV=production bundle exec rake prescat:migrate_storage_root[from,to]
```
After the rake task, the migrated moabs should no longer be associated with their old location in pres cat's database.
- Ops will be in charge of the actual storage cutover. Before Preservation Catalog's workers are re-enabled, Ops will make the old storage invisible to pres cat by un-mounting its NFS share, and mounting the target storage's NFS share. A stub storage root directory which is empty will be left in place of the original storage root, from pres cat's perspective (e.g., if migrating off of services-disk02, /services-disk02/sdr2objects should be present, but empty -- its moabs should only be available from the migration's target storage location). This will prevent pres cat from seeing the same moab in two places, which is currently unsupported.