Zip Creation - sul-dlss/preservation_catalog GitHub Wiki

Use Case

Our sole use case is atypical: uncompressed segmented files. Most people want compression: we don't. Given the criticality of long term archival preservation to this app, we also care more than most users about the stability of how the zipfiles are generated. Ideally, the same binary output (thereby the same checksums) can be achieved by this app for decades.

Zip Parts

Product ownership required large objects to be segmented into 10GB "zip parts", essentially to compartmentalize any momentary failure, e.g. to prevent uploading (or checksumming) 349 of 350 GB before failing, achieving nothing at great cost. The aws-sdk-s3 library already supports multi-part multi-threaded uploads/downloads, so this concern was more with logic and performance on our side of the application. That parallelism is still used, but now with another level of parallelism across segments/parts.

Using zip parts introduces logical complexity that did not exist previously. In addition to simply requiring more commands (more connections) to complete a single upload, it complicates subsequent questions of status, completeness and retrieval. To enable a multi-part object to be reconstructed successfully, we must store additional metadata about parts.

Threat: Implementation Drift

If the way we build zips (parts) changes on a technical level over time, it is plausible that:

  1. we get different files (different checksums) for the same input
  2. we get a different number of parts

We address #1 by modeling the ZippedMoabVersion per endpoint.

We address #2 by including additional metadata, including parts_count. The s3_keys of for a complete set of parts can be generated from any one object using this count. See DruidVersionZip#expected_part_keys.

Implementation Choices

The team considered several different implementation options, including:

  • rubyzip ruby library
  • other ruby compression libraries
  • zipruby ruby bindings for C libzip
  • zip shell executable on system

We opted for the latter. We recognize that normally shelling out to the system is not the preferred pattern, since it introduces performance cost and additional logical complexity. So this document is to explain our motivations.

Comparisons

  • rubyzip is "official" and maintained, but volatile. Using it would require us to include substantially more application logic around the construction of zip parts. Given its volatility, future versions of rubyzip would likely break the core purpose of the app. Indeed, breaking changes were released even during the period of our active development. That would necessitate pinning rubyzip to a known version, introducing an anchor that eventually would prevent other upgrades to the system, eventually including security fixes.
  • zipruby is unmaintained, pre-1.0 release
  • other ruby libraries basically suffer one or both of the same liabilities
  • System zip is unchanged since 2008.
  • When creating sizable zipfiles, the additional "cost" of a shell process is negligible.

Basically, the stability, determinism and reproducibility of the common zip executable won out.

Details

Zip files to be preserved are created by ZipmakerJob calling DruidVersionZip#create_zip!.

We attacked performance issues by parallelizing zip creation via jobs/workers. Additionally, ZipmakerJob's responsibilities are designed to be completely severable from the rest of the system. I.E., the task itself does not require database access, and the worker could be replaced by an optimized implementation in any language that can talk to Resque/Redis.