Data requirements - pod4lib/aggregator GitHub Wiki

POD accepts data in several different formats for MARC full dumps/deltas and deletes.

General requirements

  • All data MUST be syntactically valid.
  • All data provided by data contributors is understood to comply with the POD Data Provider & Usage Framework.
  • Data uploads are packaged as files, and placed in streams. See Streams and their files for more details on the interpretation of files in streams.
  • We recommend contributing files that contain no more than 200,000 records per file for more efficient processing by POD.

MARC data

Primary requirements

  • All MARC records MUST contain a 001 field as with a unique record identifier. The record identifier MUST be unique within an institution. The number should be your ILS's system number for the record and not an OCLC number.
  • MARC data MAY contain non-standard fields or subfield codes. These fields or subfields MAY be removed or normalized for downstream use.

Full dump formats

Format Notes
MARC21 binary Records SHOULD use the UTF-8 character set
Records longer than the 99,999 byte limit MAY be split into multiple MARC records as long as they are physically adjacent in the file and use the have the same MARC 001 value
MARC21 binary; gzipped See MARC21 binary
MARC21 binary; chunked Multiple files MAY be concatenated together into a single file or uploaded as separate files
MARCXML The file MUST be valid XML and use the MARC21 XML namespace (http://www.loc.gov/MARC21/slim).
The file SHOULD start with an XML declaration (e.g. <?xml version="1.0" ?>)
MARCXML; gzipped See MARCXML
MARCXML; chunked See MARCXML

Common errors

Error message Description
Records count is 0 If you provided a MARCXML file, check that the document declares and uses the MARCXML namespace (http://www.loc.gov/MARC21/slim)
XML parsing error: XML declaration allowed only at the start of the document Some systems export "MARCXML" as concatenated XML files. Ensure your XML file is valid.
XML parsing error: Unescaped '<' not allowed in attributes values Some systems fail to perform XML encoding on tags or subfield codes. Ensure your XML file is valid.
XML parsing error: Input is not proper UTF-8, indicate encoding !Bytes: 0xA0 0x4D 0x75 0x73 This is likely caused by non-UTF8 data appearing in MARC21 records (that claim to use UTF-8, even.). Correct any character encoding issues present in the file.
MARC::DataField objects can't have ControlField tag '000') MARC fields 000 - 009 MUST be control fields, and MARC fields 010 - 999 MUST be data fields
unacceptable file format File cannot be identified as MARCXML, MARC21, or a delete. Ensure the files you upload conform to the data specifications.

Record delete data

The Data Lake also supports the ability to upload a "delete" file for a stream. This delete file will specify MARC records that have been deleted.

Delete formats

Format Notes
text/plain A new-line delimited text file, uploaded via the application user interface with a file name ending in .del.txt, .del, .delete OR uploaded via the API (e.g. using curl) with the text/plain mime type (e.g. curl -F 'upload[files][][email protected];type=text/plain')
Each line should consist of a marc001 identifier that should be deleted.
POD does not support compressed text/plain deletes.
MARC21 binary (application/marc) File with .mrc file extension. Record specified as deleted by using d in position 05 in the MARC Leader
At a minimum, each record should also contain a MARC 001 field identifying the record to delete
Deletes sent as MARC21 binary records may be included in a file with added and updated records
See other MARC data notes above
MARCXML (application/marcxml+xml) File with .xml file extension. Record specified as deleted by using d in position 05 in the MARC Leader
At a minimum, each record should also contain a MARC 001 field identifying the record to delete
Deletes sent as MARC XML records may be included in a file with added and updated records
See other MARC data notes above
⚠️ **GitHub.com Fallback** ⚠️