Validate moab step fails during preservationIngestWF - sul-dlss/preservation_catalog GitHub Wiki

Note: this remediation involves working with the preservation-robots server rather than prescat but it is documented in the prescat wiki because

the issue happens at the point where a moab is created and handed off for prescat to pick up
there's no wiki for preservation robots and it seems easier to put the document here for now

Description of the issue

The validate-moab step stops at an error because one or more files in the most recent version folder in the moab do not match the checksums on the moab manifest. The error will look something like this:

"Problem with Moab validation run on preservation-catalog-prod-03.stanford.edu: [{"druid:gp362tm4122-v0001: version_additions: file_differences"=>{"digital_object_id"=>"druid:gp362tm4122|", "difference_count"=>1, "basis"=>"v1", "other"=>"/pres-03/sdr2objects/gp/362/tm/4122/gp362tm4122/v0001/data/content|/pres-03/sdr2objects/gp/362/tm/4122/gp362tm4122/v0001/data/metadata", "report_datetime"=>"2021-12-09T05:55:51Z", "group_differences"=>{"content"=>{"group_id"=>"content", "difference_count"=>1, "identical"=>4, "modified"=>1, "subsets"=>{"modified"=>{"change"=>"modified", "count"=>1, "files"=>{0=>{"change"=>"modified", "basis_path"=>"36105062472159-gb-jp2.zip", "other_path"=>"same", "signatures"=>{0=>{"size"=>376008908, "md5"=>"136e0ba1ad9aa1a3755a6a93fc389f0e", "sha1"=>"7098f91cec01490451398be324a2430256fb9be1", "sha256"=>"5737e26ab4c87dc347762d58aed5c0e3c8d347e1c6ab71e197e1d555baeacdc3"}, 1=>{"size"=>376008908, "md5"=>"8fe5db9211ea97dcbb05020afb316f06", "sha1"=>nil, "sha256"=>nil}}}}}}}, "metadata"=>{"group_id"=>"metadata", "difference_count"=>0, "identical"=>7}}}}]"

The problem here is that one or more of the files has been corrupted while being copied into the new moab version for that druid. When this has happened, it has corresponded to an issue with Ceph that required Ops to reboot one or more preservation-related servers.

Context: How a moab is made and updated

Remediating this issue requires cleaning up (deleting) some files from the preservation storage system, so it's worth reviewing how the files got there. The preservationIngestWF does the following to create a moab:

transfer-object: copies the files to be stored in the latest version of a druid from the /dor/export/{druid} folder in common-accessioning to the a deposit folder on the appropriate preservation mount (e.g. /pres-##/deposit/{druid}). The files in this folder are structured according to the BagIt specification.
validate-bag: runs BagIt validation to make sure that all files in the bag were copied correctly
update-moab: transforms the data contained in the bag (including some of the checksum info in the manifests) into the next moab version of the druid. At this point, data is copied from /pres-##/deposit/{druid} to the new version folder for the moab following this pattern /pres-##/sdr2objects/{druidtree}/{druid}/v####. This copy is made by hard linking the files in the deposit folder at /pres-##/deposit/{druid} to the new version folder in the moab at /pres-##/sdr2objects/{druidtree}/{druid}/v####.
The intention seems to have been to carry out this operation as a "move", which doesn't involve writing new bytes, but the current preservation filesystem processes this operation as a "copy" and writes to the new location, making it necessary to validate the moab in the following step. A side effect of this hard linking seems to be that the files in /pres-##/deposit/{druid} get replaced with the data written to the moab folder, so if a file gets corrupted in this process, it's corrupted at both file paths since it's the same file.
validate-moab: validates that the files copied into the new moab version at /pres-##/sdr2objects/{druidtree}/{druid}/v#### are valid using the checksums in the moab manifests. If the validation passes, the hard links between the /pres-##/deposit/{druid} and the moab folder and then the deposit folder is cleaned up.

Remediation

Remediating a single Moab

What this means is that when validate-moab fails, it's because some files were corrupted when the new moab version was created. The corruption could be to one or more of the content files, the metadata files, or the manifest files. Unfortunately, the deposit folder bag is usually also corrupted at this point, possibly because of the way the robots use hard links to copy the files between folders.

Since the moab is not valid and you generally can't trust the files in the bag either, remediating this issue means re-running the whole preservationIngestWF from the start, to re-copy the files from the /dor/export/{druid} folder, which should have been unaffected by any corruption in the preservation system.

To do this, you first need to remove the invalid moab version folder that failed the update-moab step. To do this:

ssh into the preservation-robots folder. You have to use this server because it mounts the preservation system read-write. You can't use prescat because the mounts are read-only.
cd into the moab folder for the affected druid, for example cd /pres-03/sdr2objects/yy/889/cc/1416/yy889cc1416/
Delete the folder for the invalid moab version. Be careful not to delete the whole moab, just the one (the most recent) version that failed validation. This is a risky operation since it's just you typing rm commands, but there may not be a way to make it safer without updating the pres-robots code to handle this type of failure in a different way.
So far the items requiring this remediation have all been the initial versions of objects, so the v0001 folder was the one to delete. But if the version that failed at update-moab was version 3, then you'd remove only the v0003 folder. The earlier version folders should be fine because that data was at rest when the problem occurred and should not have been corrupted. Only the in-flight data should be at risk during the update-moab step.
Once the moab version has been cleaned up, pull up a Rails console on the workflow-server-rails VM and restart the preservationIngestWF from the beginning by re-setting all steps to waiting and then running the workflow again. transfer-object will re-copy the data from /dor/export/{druid} and, assuming that data is still valid, the workflow should complete on its own.

You can use an update statement like: WorkflowStep.where(druid: druid, workflow: 'preservationIngestWF', version: version).update_all(status: 'waiting') (where druid is the prefixed druid, and version is the version being remediated for the mid-ingest error)
Once you've run the database update, you'll need to get the workflow started. The easiest way to do this is to pull up the druid in Argo, confirm that the start-ingest step of preservationIngestWF is waiting, and then hit "Set to completed" on that start-ingest step. The start-ingest step is a no-op triggered when accessionWF hands off to preservationIngestWF, and marking it completed will kick off the rest of the workflow. To do this programmatically, see the code in the workflow:step rake task in the workflow-server-rails codebase.

To be extra safe, you may want to run a preservation audit on the druid rather than wait for the automated check to run in 90 days. But if it passed the whole preservationIngestWF cleanly the second time, it should be as valid as any other newly-accessioned data. To do so, pull up a Rails console on a preservation_catalog VM, and run MoabRecord.by_druid(druid).validate_checksums!, where druid is the unprefixed ("bare") druid, unlike on WFS. This will queue a checksum validation job. If there is no backlog of checksum validation jobs, it should get worked immediately. If an error is detected, a Honeybadger alert will be fired, the preservationAuditWF status will be updated in WFS and reflected in Argo, and the status will be updated in pres cat's DB.

Remediating numerous Moabs in bulk [WIP, still needs some proof-reading]

⚠️ You should pair with someone when doing this, as a it's good to have someone watching over your shoulder when deleting content from the preservation system. You should also consider doing the following in a screen session, in case you lose your connection while working.

Go to the Argo workflow grid: https://argo.stanford.edu/report/workflow_grid
Scroll down to preservationIngestWF and click the link for the validate-moab errors facet: https://argo.stanford.edu/catalog?f%5Bwf_wps_ssim%5D%5B%5D=preservationIngestWF%3Avalidate-moab%3Aerror
Click the Columns button and select only the Druid and Status columns.
Click Download, giving the .csv a descriptive name.
Using your favorite approach to text wrangling: remove the header column of the CSV, turn the v1 In accessioning ... style strings in the second column into v0001 style Moab directory style strings (using e.g. sed, multi-selection and edit in your favorite text editor, etc). Confirm that the file you saved uses Unix-style line breaks (you might get Mac style line breaks if you used Excel on Mac to do your text wrangling)! dos2unix can fix your line break problem. Note: if we switch away from using Bash for the parts of this where we use Bash, we can probably be less particular about some of this CSV generation
Script for removing only specified Moab versions from preservation storage

#!/bin/bash

druid_list=$1

while read line 
do
        druid=$(echo "$line" | cut -d ',' -f1) 
        version=$(echo "$line" | cut -d ',' -f2) 
        druid_tree=$(echo "$druid" | sed -r 's/([0-9])([a-zA-Z])/\1\/\2/g; s/([a-zA-Z])([0-9])/\1\/\2/g')

        echo "$druid,$druid_tree,$version" 

        max_moab_version_path=$(find /pres-0*/sdr2objects/"$druid_tree/$druid" -mindepth 1 -maxdepth 1 -type d | sort | tail -n 1)
        max_moab_version=$(echo "$max_moab_version_path" | cut -d '/' -f9)

        if [ "$max_moab_version" == "$version" ]
        then
                echo "versions match"
                rm -rv "$max_moab_version_path"

        else
                echo "version mismatch, quitting"
                exit
        fi

done < "$druid_list"

Resetting/rewinding preservationIngestWF for the druid versions from the report TODO: add the line of code for getting druid_list from your report

RabbitFactory.start_global

druid_versions.each do |druid_version|
  druid = druid_version[0]
  version = druid_version[1]
  query = WorkflowStep.where(druid: druid, workflow: 'preservationIngestWF', version: version)
  # puts query.order(:druid, :workflow, :process).pluck(:druid, :workflow, :process, :status, :version)
  query.update_all(status: 'waiting')
  step = WorkflowStep.find_by(
      druid: druid,
      workflow: 'preservationIngestWF',
      process: 'start-ingest',
      version: version
    )
  step.update(status: 'completed')
  next_step = WorkflowStep.find_by(
      druid: druid,
      workflow: 'preservationIngestWF',
      process: 'transfer-object',
      version: version
    )
  NextStepService.enqueue_next_steps(step: next_step)
  SendUpdateMessage.publish(step: step)
end

watching for issues/running audits