Data requirements - pod4lib/aggregator GitHub Wiki

POD accepts data in several different formats for MARC full dumps/deltas and deletes.

General requirements

All data MUST be syntactically valid.
All data provided by data contributors is understood to comply with the POD Data Provider & Usage Framework.
Data uploads are packaged as files, and placed in streams. See Streams and their files for more details on the interpretation of files in streams.
We recommend contributing files that contain no more than 200,000 records per file for more efficient processing by POD.

MARC data

Primary requirements

All MARC records MUST contain a 001 field as with a unique record identifier. The record identifier MUST be unique within an institution. The number should be your ILS's system number for the record and not an OCLC number.
MARC data MAY contain non-standard fields or subfield codes. These fields or subfields MAY be removed or normalized for downstream use.

Full dump formats

Format	Notes
MARC21 binary	Records SHOULD use the UTF-8 character set Records longer than the 99,999 byte limit MAY be split into multiple MARC records as long as they are physically adjacent in the file and use the have the same MARC 001 value
MARC21 binary; gzipped	See MARC21 binary
MARC21 binary; chunked	Multiple files MAY be concatenated together into a single file or uploaded as separate files
MARCXML	The file MUST be valid XML and use the MARC21 XML namespace (`http://www.loc.gov/MARC21/slim`). The file MUST start with an XML declaration (e.g. `<?xml version="1.0" ?>`)
MARCXML; gzipped	See MARCXML
MARCXML; chunked	See MARCXML

Common errors

Error message	Description
Records count is 0	If you provided a MARCXML file, check that the document declares and uses the MARCXML namespace (http://www.loc.gov/MARC21/slim)
XML parsing error: XML declaration allowed only at the start of the document	Some systems export "MARCXML" as concatenated XML files. Ensure your XML file is valid.
XML parsing error: Unescaped '<' not allowed in attributes values	Some systems fail to perform XML encoding on tags or subfield codes. Ensure your XML file is valid.
XML parsing error: Input is not proper UTF-8, indicate encoding !Bytes: 0xA0 0x4D 0x75 0x73	This is likely caused by non-UTF8 data appearing in MARC21 records (that claim to use UTF-8, even.). Correct any character encoding issues present in the file.
MARC::DataField objects can't have ControlField tag '000')	MARC fields 000 - 009 MUST be control fields, and MARC fields 010 - 999 MUST be data fields
unacceptable file format	File cannot be identified as MARCXML, MARC21, or a delete. Ensure the files you upload conform to the data specifications.

Record delete data

The Data Lake also supports the ability to upload a "delete" file for a stream. This delete file will specify MARC records that have been deleted.

Delete formats

Format	Notes
`text/plain`	A new-line delimited text file, uploaded via the application user interface with a file name ending in `.del.txt`, `.del`, `.delete` OR uploaded via the API (e.g. using curl) with the `text/plain` mime type (e.g. `curl -F 'upload[files][][email protected];type=text/plain'`) Each line should consist of a marc001 identifier that should be deleted. POD does not support compressed text/plain deletes.
MARC21 binary (`application/marc`)	File with `.mrc` file extension. Record specified as deleted by using `d` in position `05` in the MARC Leader At a minimum, each record should also contain a MARC 001 field identifying the record to delete Deletes sent as MARC21 binary records may be included in a file with added and updated records See other MARC data notes above
MARCXML (`application/marcxml+xml`)	File with `.xml` file extension. Record specified as deleted by using `d` in position `05` in the MARC Leader At a minimum, each record should also contain a MARC 001 field identifying the record to delete Deletes sent as MARC XML records may be included in a file with added and updated records See other MARC data notes above

⚠️ GitHub.com Fallback ⚠️