Overview - sul-dlss/preservation_catalog GitHub Wiki

Preservation Catalog Overview

What does Preservation Catalog do?

  • Maintains a catalog of all objects on all known storage roots
    • A storage root contains a druid tree. Each full druid path contains a Moab. Example storage roots are available in spec/fixtures/.
      • Moab is our preservation format. It uses forward-delta versioning to minimize archive size. Each version directory has metadata about its contents, including checksums that can be used for fixity checking. More information on the Moab perservation format is available here: http://journal.code4lib.org/articles/8482
  • Regularly and continuously checks the integrity of the Moabs it manages, including validation of Moab directory structure and fixity checking of content.
  • Replicates Moabs to S3 compatible cloud buckets.
    • our robot infrastructure messages Preservation Catalog when an object has landed on a storage root after ingest or versioning. Pres Cat replicates the object.
      • If needed, prior versions of an object will be backfilled when the current version is replicated.
      • If needed, missing parts will be backfilled by a replication audit.
    • Much of the replication happens automatically via ActiveRecord hooks -- e.g. when a CompleteMoab is created, the associated records for the ZippedMoabVersions are automatically created. Upon creation of those records, a job is queued for each so that the zip file for each version eventually gets created. After that, each worker calls the next worker in the chain upon success (ZipmakerJob -> PlexerJob -> [S3WestDeliveryJob, IbmSouthDeliveryJob] -> ResultsRecorderJob).
      • The worker that makes zips and the worker that does delivery are intentionally designed to not need the DB. Though they are rails workers in the pres cat app at the moment, we could rewrite them in something else in another codebase later, if we felt that would be more performant. Plexer and ResultsRecorder must query and update the DB, and so have to know about it.
  • Regularly and continuously audits whether each Moab it knows about is also replicated to the endpoints specified by the applicable preservation policy.

What's the stack?

  • Postgres: The database of record. Has metadata about storage locations (Moab Storage Roots) and replication endpoints (Zip Endpoints), as well as the policy info for which Moabs should get replicated to which archive endpoints. Also tracks the state of all the Moabs on all the known Moab Storage Roots, as well as their replicated copies on Zip Endpoints.
  • Rails: We use ActiveRecord heavily for interacting with PG, ActiveJob heavily for implementing workers, and a bit of the web stuff for receiving info from workers.
  • Redis, Sidekiq: For managing workers.

What sends info to pres cat?

The robots notify pres cat when an object is ingested or versioned, so that it can be added to the catalog, and eventually replicated to the appropriate cloud providers.

What does pres cat send info to?

  • The replication results queue.
  • Workflows
    • Solely for the purpose of exposing audit results. Pres Cat does not act directly on workflow states (though it may act on input from robots that do).

What can a user do manually with pres cat?

  • View the dashboard
  • Query the catalog to see what state a particular object or set of objects is in, how many objects are in a given state for a given storage root, how many objects are in a given state for a given replication endpoint, etc.
  • Integrity check individual objects, lists of objects, objects on a given storage root, etc.
  • Replicate objects. This should only be a manual process when backfilling versions for objects that were added to the catalog before an applicable Zip Endpoint was added.