Replication errors - sul-dlss/preservation_catalog GitHub Wiki

Overview

Each moab has multiple versions. A zip of each versions should be replicated to 3 endpoints. The zip for a version may be split into multiple parts depending on size, meaning a single version could have multiple zip files.

This page has a Diagnosis Section followed by a TroubleShooting section with recipes for fixing problems.

Diagnosis: Finding Which Druids Are Impacted

Run this from the rails console:

druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq

Diagnosis: Data for a Single Druid

Run RAILS_ENV=production bin/rake prescat:diagnose_replication[(druid)] using the rake task to get the replication state:

RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]

If for a given endpoint and version, all are "ok" then the replication is correct.

druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint

If for a given endpoint and version, all or some are "unreplicated" then it is unreplicated for that endpoint and version.

druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint

If for a given version, an endpoint is missing then it is "unreplicated" for that endpoint and version.

druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint

Sometimes the problem may resolve on its own. To verify this:

Re-run diagnose_replication:

RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]

Run a Moab Replication Audit (which will report problems to Honeybadger):

RAILS_ENV=production bin/rake prescat:audit:replication_single[fd812vz8360]

Diagnosis: Data for a List of Druids

Diagnosis can also be performed in bulk from the Rails console using a list of druids. The following gets debug info for unreplicated zip parts:

druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
debug_infos = Audit::ReplicationSupport.zip_part_debug_info(druids)
# Wait for it ...
CSV.open('debug_info.csv', 'wb') {|csv| debug_infos.each {|debug_info| csv << debug_info }}

The column names are not provided, they are in app/services/audit/replication_support.rb, and are the same as for the diagnose_replication task:

druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5

Troubleshooting

Note that with Sidekiq, retries of replication failures are automatic. If a ZipPart is "not found at endpoint" per the diagnose_replication rake tasks, and if the unreplicated ZipParts are "recent" (is this 1 week? how long do sidekiq retries keep trying?), then it's likely the problem will resolve itself. The ZipPart datestamps will indicate how recently the record has been updated.

Case 1: ALL of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"

The output from rake prescat:diagnose_replication might look like this:

$ RAILS_ENV=production bin/rake prescat:diagnose_replication[bn435ff2092]
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
bn435ff2092,2,1,aws_s3_east_1,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468306,2022-05-23 20:48:49 UTC,2023-04-13 22:31:49 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,aws_s3_west_2,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468305,2022-05-23 20:48:49 UTC,2023-04-13 22:31:48 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,ibm_us_south,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468308,2022-05-23 20:48:49 UTC,2023-04-13 22:31:50 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,2,aws_s3_east_1,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892115,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,aws_s3_west_2,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892109,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,ibm_us_south,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892112,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,

Note that in this case, zipped moab version 1 has status "ok" for all endpoints for the first version. Zipped moab version 2 has status "unreplicated" for all endpoints and the zip endpoint status is "not found at endpoint" for all endpoints as well.

Case 1 Remediation: Step 1 - Prune the database records

Prune the database records for the zip_parts that are not at the endpoints.

There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq). By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this, include "false" as the last argument:

$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bn435ff2092,2,false]
pruned zipped moab version 2 on aws_s3_west_2
pruned zipped moab version 2 on ibm_us_south
pruned zipped moab version 2 on aws_s3_east_1

Note that pruning provides a message for each record pruned.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the pruning worked.

Case 1 Remediation: Step 2 - Replicate Missing Version with Backfill

Re-run the replication audit for the druid. We have backfill on by default, so once the database records are removed, the audit will do the replication.

$ RAILS_ENV=production bin/rake prescat:audit:replication_single[bn320qt6030]

This will report errors to Honeybadger.

NOTE: If there are a lot of versions in your object, or if the files are large (there is a size column in diagose_replication), you might want to just run the backfill needed with this rake task:

RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]

Replication jobs may take some amount of time to complete.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the replication worked.

Case 2: SOME of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"

In this case, we only want to remediate replication for the missing endpoint(s) for the version; we can leave the zip_parts with "ok" status and "found at endpoint" alone.

The output from RAILS_ENV=production bin/rake rake prescat:diagnose_replication[dn073hg2651] might look like this:

druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint

Note that 1 of the endpoints (for version 1, the only version here) has a ZipPart status of "unreplicated" and a status of "not found at endpoint" while the other two have status of "ok" (and found at endpoint).

[TBD: THIS MAY BE THE SAME AS CASE 1 IF THE DATABASE PRUNING AND THE BACKFILL TASKS LEAVE THE 'ok' ZIP_PARTS ALONE]

Case 2 Remediation: Step 1 - Prune the database record(s)

Prune the database records for the zip_parts that are not at the endpoints.

[TBD: I HAVE NOT TESTED THE RAKE TASK FOR THIS CASE - WILL IT LEAVE THE ONES WITH "ok" STATUS ALONE?]

$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[dn073hg2651,1,false]
pruned zipped moab version 1 on aws_s3_east_1

Note that pruning provides a message for each record pruned.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the pruning worked.

Case 2 Remediation: Step 2 - Replicate To Missing Endpoint(s) with Backfill

Re-run the replication audit for the druid. We have backfill on by default, so once the database records are removed, the audit will do the replication.

$ RAILS_ENV=production bin/rake prescat:audit:replication_single[dn073hg2651]

This will report errors to Honeybadger.

[TBD: I HAVE NOT TESTED THE RAKE TASK BELOW FOR THIS CASE - WILL IT LEAVE THE ONES WITH "ok" STATUS ALONE?]

NOTE: If there are a lot of versions in your object, or if the files are large (there is a size column in diagose_replication), you might want to just run the backfill needed with this rake task:

RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]

Replication jobs may take some amount of time to complete.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the replication worked.

Case 3: a zip_part status is "unreplicated" but it is "found at endpoint" and the checksums match

If a ZipPart status is "unreplicated" but it IS "found at endpoint" and the checksum at the endpoint matches the checksum stored in ZipPart, then we can just change the ZipPart status to "ok."

# in shell, at rails root, look up the diagnosis for the druid:
$ RAILS_ENV=production bin/rake prescat:diagnose_replication[bh230tw3168]
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
bh230tw3168,2,1,aws_s3_east_1,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796826,2022-10-31 09:32:40 UTC,2023-04-13 22:30:28 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,1,aws_s3_west_2,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796824,2022-10-31 09:32:39 UTC,2023-04-13 22:30:25 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,1,ibm_us_south,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796825,2022-10-31 09:32:39 UTC,2023-04-13 22:30:31 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,2,aws_s3_east_1,unreplicated,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805720,2023-04-13 01:25:54 UTC,2023-04-13 01:25:54 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243
bh230tw3168,2,2,aws_s3_west_2,ok,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805718,2023-04-13 01:25:54 UTC,2023-04-13 22:30:26 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243
bh230tw3168,2,2,ibm_us_south,ok,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805719,2023-04-13 01:25:54 UTC,2023-04-13 22:30:33 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243

Case 3 Remediation: Step 1 - Compare the ZipPart and endpoint checksums

Use a computer to comparse the checksums, not your eyeballs. You can do it in the rails console:

> s1 = 'ef96d8272c1e18a36723e68fb9b34243' # <- cut and pasted from ZipPart checksum column
> s2 = 'ef96d8272c1e18a36723e68fb9b34243' # <- cut and pasted from endpoint checksum column
> s1 == s2
 => true

If the checksums do NOT match, it is not a Case 3 problem - go to Case 4.

Case 3 Remediation: Step 2 - Update the ZipPart Status

The ZipPart id is in the diagnose_replication output for the druid. Let's say it is 262805720.

From the rails console:

> z = ZipPart.find(262805720)
 => 
#<ZipPart:0x000055ea91b55de8
 id: 262805720,
 size: 45844,
 zipped_moab_version_id: 33360781,
 created_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
 updated_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
 md5: "ef96d8272c1e18a36723e68fb9b34243",
 create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/bh/230/tw/3168/bh230tw3168.v0002.zip bh230tw3168/v0002\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
 parts_count: 1,
 suffix: ".zip",
 status: "unreplicated",
 last_existence_check: nil,
 last_checksum_validation: nil> 
> z.status = "ok"              # <-- updating the status to 'ok'
 => "ok" 
> z.save                       # <-- saving the updated status
 => true 
> z = ZipPart.find(262805720)  # <-- double checking the object has been updated
 => 
#<ZipPart:0x000055ea91b55de8
 id: 262805720,
 size: 45844,
 zipped_moab_version_id: 33360781,
 created_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
 updated_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
 md5: "ef96d8272c1e18a36723e68fb9b34243",
 create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/bh/230/tw/3168/bh230tw3168.v0002.zip bh230tw3168/v0002\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
 parts_count: 1,
 suffix: ".zip",
 status: "ok",
 last_existence_check: nil,
 last_checksum_validation: nil>

You can also double check with the diagnose_replication rake task.

Case 4: a ZipPart status is "unreplicated" but it is "found at endpoint" and the checksums don't match

Case 4 Remediation: Step 1 - Delete the Bad Zips

If any of the zip parts are unreplicated but found at the zip endpoint, then delete the zips. See S3 Deletions.

Case 4 Remediation: Step 2 - Prune the bad data from the database

Prune the database records for the zip_parts that are not at the endpoints.

There is a rake task for this, but it will delete ZipPart and ZippedMoabVersion for ALL the endpoints. If you don't wish to do this, you will need to read the code to determine how to remove the database records only for the endpoint(s) needed.

There are three arguments to the prune_failed_replication rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq). By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this, include "false" as the last argument:

$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bn435ff2092,2,false]
pruned zipped moab version 2 on aws_s3_west_2
pruned zipped moab version 2 on ibm_us_south
pruned zipped moab version 2 on aws_s3_east_1

Note that pruning provides a message for each record pruned.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the pruning worked.

Case 4 Remediation: Step 3 - Backfill the pruned database records and initiate replication

Re-run the replication audit for the druid with rake. We have backfill on by default, so once the database records are removed, the audit will do the replication.

$ RAILS_ENV=production bin/rake prescat:audit:replication_single[dn073hg2651]

This will report errors to Honeybadger.

NOTE: If there are a lot of versions in your object, or if the files are large (there is a size column in diagose_replication), you might want to just run the backfill needed with this rake task:

RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]

Replication jobs may take some amount of time to complete.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the replication worked.