Streams and their files - pod4lib/aggregator GitHub Wiki

Streams and their files

This guidance (in particular, the expectation that a default stream will include a full dump) can now be considered standard expectations for POD data providers. We may continue to update it with clarifications, examples, and/or diagrams, but the basic requirements will continue to hold. See the "Getting Help" section of our Wiki Home if you have questions.

Streams represent a set of records associated with an organization. For the POD-Reshare project, the default stream is assumed to represent the full holdings an organization intends to expose to Reshare and other peers. The full set of holdings is represented by the files in the stream that contain new records, changed records, and deleted records, processed in the order in which the files were uploaded. Default streams, and the mechanics of creating streams and uploading files to them, are further discussed under Uploading data using the API.

Full dumps and deltas (files with changed, added, and/or deleted records) are displayed in the aggregator user interface with their timestamp in order from oldest to newest. Likewise, ResourceSync's response to a request for a resource list of normalized data lists the available full dumps and deltas in order from oldest to newest and contains a lastmod field containing a timestamp that can be used to determine processing order.

The order of processing of multiple files embedded in the same uploaded file (such as a tar or zip file) is not well-defined. We therefore recommend only including a single file in an uploaded file's package unless the uploader can be sure that none of the files have any record IDs in common. For processing efficiency, we also recommend that file packagings be processable as a stream (gzipped files are; tar files might not be), so that they do not need to be completely unpacked or parsed before they can be processed.

Full dumps and incremental updates

Normally, the oldest files in a stream represent a full dump, and later files indicate incremental additions, changes, and deletions. There is, however, no technical difference between files uploaded as part of a full dump, and files uploaded as incremental updates. Accepted formats for uploaded files can be found under Data requirements.

A default stream should not contain only an incremental update. If it does, the incremental update will be interpreted as a full dump. Incremental updates to a full dump should be placed in the same stream as the full dump to be properly understood.

Names of files in streams are considered reusable labels. If a file is uploaded with the same name as a file already in the stream, it will be considered a new file unrelated to the old one, and processed separately based on its arrival date. However, for clarity we recommend that new files be given unique names in the stream if feasible.

A second full dump placed into a stream may not be understood as expected if IDs for records in previous files and not in the new full dump are not explicitly deleted. For example, a full dump with IDs 1, 2, 3, and 4, followed by a full dump in the same stream with IDs 1, 2, and 5, will be understood as a set of records with IDs 1, 2, 3, 4, and 5 (where the contents of the records with IDs 1, 2, and 5 match their contents in the second full dump, and the contents of the records with IDs 3 and 4 match their contents in the first full dump.) Adding a file deleting IDs 3 and 4 will make the stream understood as the set of records in the second full dump. New full dumps may also be uploaded to a new stream, with that stream then made as the default, without requiring deletions.

Errors

The uploaded file's status is only exposed via the UI. The ResourceSync API lists all the uploaded files (including files with problems) and all the normalized full and delta dumps (which would not include files/records/deletes with problems).

Background jobs that encounter processing errors (Sidekiq goes down, uncaught error, etc.) may be retried multiple times before giving up. Generation of full and delta dumps relies on the file's upload timestamp and not the order of processing success by sidekiq.

Files that cannot be identified as MARCXML or MARC21 will be ignored when full dumps and deltas are generated.

Uploaders with files that they are not sure will be processed as intended may want to test them by uploading them to a non-default stream and seeing how they are processed. If files with errors are uploaded to a default stream, uploaders may want to see how they have been interpreted in normalized files, and then upload further updates to correct any missed or misinterpreted changes. If all else fails, a new full dump can be made to a new stream and that stream then made the default.