ActiveRecord and Replication intro - sul-dlss/preservation_catalog GitHub Wiki

ActiveRecord Usage

As an ORM for the Preservation Catalog's underlying Postgres SQL database, ActiveRecord combined with the Rails Console gives us a way to interact with Preservation Catalog (to do things like replicate Moabs to cloud archives).

https://guides.rubyonrails.org/active_record_basics.html

Launching Rails Console

After SSH into the production VM, launch the console

cd preservation_catalog/current/
bundle exec rails console production

Performing CV on a druid

CompleteMoab.by_druid(druid).each(&:validate_checksums!)

Firing off M2C on a specific druid

You might want to do this if you find a specific druid that you know has an outdated version in the catalog.

root = MoabStorageRoot.find_by(…)
MoabToCatalogJob.perform_now(root, druid, path) # or perform_later

Sending ZipMaker Non Replicated Druids

For the simplest case of backfilling as-yet unreplicated versions of a single Moab, you can kick off replication by pasting code similar to the following in rails console:

# by_druid scope could theoretically return multiple rows, though there
# is only one CompleteMoab per druid as of this writing
CompleteMoab.by_druid('jf301dx7536').first.create_zipped_moab_versions!

If you have a list of druids you want to replicate, you could load them up into an array of strings (e.g. by reading the lines out of a text file), and do the following:

# druid_list is an array of strings representing the druids to replicate. providing
# a list to a parameter in the where clause of an ActiveRecord query will automatically
# generate SQL with an IN clause query instead of a single-value equality comparison.
CompleteMoab.by_druid(druid_list).find_each(&:create_zipped_moab_versions!)

From @julianmorley's request in Slack about how to do the following,

sending zipmaker only druids that prescat doesn't think are replicated to us-west-2, limited to druids < 10G, limit 10K druids

The following ruby code can be pasted and run in the rails console:

moab_size_limit = 10_000_000_000 # in bytes
result_limit = 10_000
endpoint = ZipEndpoint.find_by!(endpoint_name: 'aws_s3_west_2')
CompleteMoab.where.not(
  id: endpoint.zipped_moab_versions.select(:complete_moab_id)
).where("size < ?", moab_size_limit).limit(result_limit).find_each(&:create_zipped_moab_versions!)

Alternatively, you could split the above into two separately run queries, since it'd be fine to get the list of druids to avoid once, before the outer query is run (generally it's better practice to get everything into one query if you can, but Julian ran into some performance trouble with the all-in-one version, and a little kludging is fine for one-off manual scripting):

moab_size_limit = 10_000_000_000 # in bytes
result_limit = 10_000
endpoint = ZipEndpoint.find_by!(endpoint_name: 'aws_s3_west_2')
# id_list could be whatever, could be e.g. the result of endpoint.zipped_moab_versions.pluck(:complete_moab_id). note
# the `pluck` on the separate query, which returns an array -- don't want to pass the AR relation into the outer which is
# query, the same thing the above is doing.  want to pass an already retrieved list of strings instead.
id_list = endpoint.zipped_moab_versions.pluck(:complete_moab_id).uniq
CompleteMoab.where.not(
  id: id_list
).where("size < ?", moab_size_limit).limit(result_limit).find_each(&:create_zipped_moab_versions!)

Explanation of what is happening

Using ActiveRecord, the create_zipped_moab_versions! call in the ruby code above will create the ZippedMoabVersion records in the DB that will represent the things we're shipping off to the cloud. A hook on the ZippedMoabVersion model will then initiate the replication process by invoking the zip maker job.

The following SQL for selecting the druids that meet these conditions was automatically generated by ActiveRecord (from the all-in-one query with the AR select subquery):

-- just for illustration/context, the Rails console must be used to initiate replication
SELECT  "complete_moabs".* 
FROM "complete_moabs" 
WHERE ("complete_moabs"."id" NOT IN (
          SELECT "zipped_moab_versions"."complete_moab_id" 
          FROM "zipped_moab_versions" 
          WHERE "zipped_moab_versions"."zip_endpoint_id" = 1)) 
AND (size < 10000000000) 
LIMIT 10000

Find out which disk a moab is on

PreservedObject.find_by(druid: 'th060bv0250').complete_moabs.first.moab_storage_root.name