Fixing an Incomplete Moab Upload - sul-dlss/preservation_catalog GitHub Wiki

Possible Symptoms (it's possible only a subset will apply)

  • A large Moab (>10GB) has only uploaded a portion of its archive files.
  • PartReplicationAuditJob detects a problem with a replicated Moab (e.g. https://app.honeybadger.io/projects/54415/faults/64996758)
  • The Settings.zip_storage volume (e.g. sdr-transfers) ran out of space during transfers/upload.
  • The parts_count value of the uploaded segments is incorrect.
  • There are entries in the delivery failure (*_delivery_failed, e.g. s3_us_east_1_delivery_failed), esp for a druid version that had to be split into multiple (10 GB) segments.
    • The most common cause of this is a network blip when trying to reach the AWS or IBM S3 endpoint(s).

Fix

Easier fix: automatic retries

Usually Sidekiq automatic retries handle this, because the problem is typically a transient network issue.

First: if you see an alert about a delivery failure, you should see whether it was successfully retried by Sidekiq (e.g. not in the Retries queue or the Dead queue, maybe in progress in Busy).

If the replication was completed after successfully retry of the failed delivery attempt, you are done. You can use the info here to help determine whether something was fully/successfully replicated if you want to make sure: Replication errors

After Settings.zip_cache_expiry_time without being accessed, the archive zip will have been automatically cleaned from the zip generation temp space (and retrying the delivery job won't re-create it, because zip creation is an earlier step in the replication pipeline). If retrying failed delivery jobs didn't resolve the issue (e.g. because the zip files aged out of temp space between retries, or because the issue stems at least in part from something other than failed delivery attempts), read on...

More involved manual fix

  1. Delete the zip_parts and zipped_moab_versions database records for failed/partial replication attempts. ⚠️ Please use the rake task described below, since it has safeguards against overly broad deletions. If you'd like to remediate in bulk from Rails console, please use the CatalogRemediator class method that the rake task wraps.
  2. Delete any/all remaining pieces of the Moab's zip from /sdr-transfers (or where ever Settings.zip_storage points).
  • can be done by dev or ops from any of pres cat prod worker boxes, i.e. any other than -01
  • can also just wait 7 days for the auto-cleanup to purge any remaining pieces - the rake task used below won't act on any ZMVs younger than zip_cache_expiry_time anyway.
  1. Delete any/all pieces of the failed Moab-version upload that made it to an S3 endpoint from that one endpoint. (ops task, delete access is restricted)
  2. Re-trigger the creation of any missing zipped_moab_versions and zip_parts once the above steps have been completed for a mis-replicated druid. (dev task, see below)

Example: using rake task to clean up for a druid where zip parts were never created

pres@preservation-catalog-prod-02:~/preservation_catalog/current$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9]
pruned zipped moab version 9 on aws_s3_west_2
pruned zipped moab version 9 on aws_s3_east_1
pruned zipped moab version 9 on ibm_us_south

Example: using rake task to clean up for a druid where zip parts were created

pres@preservation-catalog-prod-02:~/preservation_catalog/current$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[zy046vr4617,1]
pruned zipped moab version 1 on ibm_us_south

You can pass along the above info to ops, e.g.: "Please delete the zip parts for version 1 of zy046vr4617 from ibm_us_south, if any exist."

If you want to see more detail from the logs about what was cleaned up, you can do something like:

pres@preservation-catalog-prod-02:~/preservation_catalog/current$ grep -B1 'Destroying zip parts' log/production.log
I, [2022-03-16T08:27:09.586829 #3477151]  INFO -- : Replication failure error(s) found with bb001zc5754 (v9): [{:zip_parts_not_created=>"9 on aws_s3_west_2: no zip_parts exist yet for this ZippedMoabVersion"}]
Destroying zip parts ([]) and zipped moab version (20706447)
I, [2022-03-16T08:27:09.756150 #3477151]  INFO -- : Replication failure error(s) found with bb001zc5754 (v9): [{:zip_parts_not_created=>"9 on aws_s3_east_1: no zip_parts exist yet for this ZippedMoabVersion"}]
Destroying zip parts ([]) and zipped moab version (20706448)
I, [2022-03-16T08:27:09.764788 #3477151]  INFO -- : Replication failure error(s) found with bb001zc5754 (v9): [{:zip_parts_not_created=>"9 on ibm_us_south: no zip_parts exist yet for this ZippedMoabVersion"}]
Destroying zip parts ([]) and zipped moab version (20706449)
--
I, [2022-03-16T09:22:40.579252 #3483337]  INFO -- : Replication failure error(s) found with zy046vr4617 (v1): [{:zip_parts_not_all_replicated=>"1 on ibm_us_south: not all ZippedMoabVersion parts are replicated yet: [#<ZipPart id: 13381599, size: 10737418240, zipped_moab_version_id: 13294700, created_at: \"2019-03-26 15:30:13.268487000 +0000\", updated_at: \"2019-03-26 15:30:13.268487000 +0000\", md5: \"438ebd78b335f8015ec8895cb9fb1346\", create_info: \"{:zip_cmd=>\\\"zip -r0X -s 10g /sdr-transfers/zy/046/...\", parts_count: 34, suffix: \".z30\", status: \"unreplicated\", last_existence_check: nil, last_checksum_validation: nil>]"}]
Destroying zip parts ([13381434, 13381411, 13381418, 13381426, 13381627, 13381438, 13381441, 13381445, 13381453, 13381457, 13381460, 13381464, 13381470, 13381483, 13381491, 13381499, 13381507, 13381521, 13381528, 13381531, 13381534, 13381538, 13381541, 13381544, 13381552, 13381561, 13381577, 13381585, 13381590, 13381599, 13381604, 13381612, 13381622, 13381408]) and zipped moab version (13294700)

In the above example, you can see that zy046vr4617 v1 is a 30 part druid version of which one part failed to replicate successfully. Once ops cleans up the other 29 (😢), it can be pushed through replication again (see below). bb001zc5754 can be pushed through immediately, as there were no zip parts pushed to S3 in the first place.

Example: Triggering replication for a druid where partial replication problems have been cleaned up

NOTE: if there were any zip_parts records that were cleaned up for the druid version, confirm that any partial replication for the druid version has been cleaned up from S3. The most common way this situation occurs is when a large (> 10 GB) druid ran into network issues on delivery attempts for some but not all zip parts, e.g. if the failure queues show entries for a .z02 zip part but not for e.g. the .zip or .z01 parts.

[1] pry(main)> PreservedObject.find_by(druid: 'dc156hp0190').create_zipped_moab_versions!

The create_zipped_moab_versions! call forces PresCat to re-create any missing zipped_moab_versions for the druid, including anything that was cleaned up by the remediation described above. This will also trigger the rest of the replication pipeline, causing fresh archive zips to be generated and pushed to the S3 endpoints (hence why any partial uploads must be cleaned up from S3 before that's done, since pres cat won't overwrite zip parts that've been uploaded to S3 already).