Zip Creation - sul-dlss/preservation_catalog GitHub Wiki
Use Case
Our sole use case is atypical: uncompressed segmented files. Most people want compression: we don't. Given the criticality of long term archival preservation to this app, we also care more than most users about the stability of how the zipfiles are generated. Ideally, the same binary output (thereby the same checksums) can be achieved by this app for decades.
Zip Parts
Product ownership required large objects to be segmented into 10GB "zip parts", essentially to compartmentalize any momentary failure, e.g. to prevent uploading (or checksumming) 349 of 350 GB before failing, achieving nothing at great cost. The aws-sdk-s3
library already supports multi-part multi-threaded uploads/downloads, so this concern was more with logic and performance on our side of the application. That parallelism is still used, but now with another level of parallelism across segments/parts.
Using zip parts introduces logical complexity that did not exist previously. In addition to simply requiring more commands (more connections) to complete a single upload, it complicates subsequent questions of status, completeness and retrieval. To enable a multi-part object to be reconstructed successfully, we must store additional metadata about parts.
Threat: Implementation Drift
If the way we build zips (parts) changes on a technical level over time, it is plausible that:
- we get different files (different checksums) for the same input
- we get a different number of parts
We address #1 by modeling the ZippedMoabVersion
per endpoint.
We address #2 by including additional metadata, including parts_count
. The s3_keys of for a complete set of parts can be generated from any one object using this count. See DruidVersionZip#expected_part_keys
.
Implementation Choices
The team considered several different implementation options, including:
rubyzip
ruby library- other ruby compression libraries
zipruby
ruby bindings for C libzipzip
shell executable on system
We opted for the latter. We recognize that normally shelling out to the system is not the preferred pattern, since it introduces performance cost and additional logical complexity. So this document is to explain our motivations.
Comparisons
rubyzip
is "official" and maintained, but volatile. Using it would require us to include substantially more application logic around the construction of zip parts. Given its volatility, future versions ofrubyzip
would likely break the core purpose of the app. Indeed, breaking changes were released even during the period of our active development. That would necessitate pinningrubyzip
to a known version, introducing an anchor that eventually would prevent other upgrades to the system, eventually including security fixes.zipruby
is unmaintained, pre-1.0
release- other ruby libraries basically suffer one or both of the same liabilities
- System
zip
is unchanged since 2008. - When creating sizable zipfiles, the additional "cost" of a shell process is negligible.
Basically, the stability, determinism and reproducibility of the common zip
executable won out.
Details
Zip files to be preserved are created by ZipmakerJob
calling DruidVersionZip#create_zip!
.
We attacked performance issues by parallelizing zip creation via jobs/workers. Additionally, ZipmakerJob
's responsibilities are designed to be completely severable from the rest of the system. I.E., the task itself does not require database access, and the worker could be replaced by an optimized implementation in any language that can talk to Resque/Redis.