Harvesting using OAI PMH - pod4lib/aggregator GitHub Wiki

Along with its ResourceSync service, the POD Aggregator also has an OAI-PMH service for harvesting data records. Some portions of the OAI-PMH protocol not strictly necessary for POD-ReShare data consumption are not supported yet.

Consumers can use the OAI-PMH interface to get all the normalized records from a particular provider stream (specified with a set argument), or just to get records that have been added or changed (as well as identifiers of records that have been deleted) since a particular time. Records are returned as MARC XML elements in the XML documents returned by the service. Up to 10,000 records may be returned at a time. (The number of records in a response is subject to change, however.) If the full answer to a request involves more records than fit in an initial response, a resumption token is also returned that a client can use to get more records, as described in the OAI-PMH protocol specification.

API documentation

The base URL for the API server is https://pod.stanford.edu/oai . As with other POD API requests, all POD OAI-PMH requests must include an access token generated by the POD Aggregator. Requests use one of four verbs described in the OAI-PMH protocol specification: Identify, ListSets, ListMetadataFormats, and ListRecords. (The ListIdentifiers and GetRecord verbs are not supported at this time.)

Discovering sets for harvesting

The ListSets verb (full URL: https://pod.stanford.edu/oai?verb=ListSets) shows the sets of records available for harvesting.

A set is available for each stream that can be harvested. The ListSets response includes setDescription elements inside the returned set elements. The set description for the current default stream includes has a dc:type element (inside a Dublin Core element) with the exact value default. (Other streams may have dc:type elements that may include the word "default" within a larger string, such as "former default". Those are not current default streams.) The dc:contributor element in the setDescription contains the slug of the institution providing the data in the set.

OAI-PMH clients can use the dc:contributor and dc:type elements to discover the current default set for institutions of interest. For example, to find the current default set for the University of Pennsylvania libraries, a client can parse the ListSets response to find a set whose setDescription includes a dc:contributor element with value penn and a dc:type element with value default.

Once the desired set is discovered, the client can then request records using the identifier of the applicable set, which is given in the set's setSpec element.

Harvesting records

The most up to date records for an institution will generally be in the set for its default stream. The records in that stream can be harvested with the ListRecords verb.

A complete harvest from a provider institution

A harvest of a full set of MARC records from a set can be requested with this URL: https://pod.stanford.edu/oai?verb=ListRecords&metadataPrefix=marc21&set=$SET where $SET is the set for the institutional stream to be harvested. (It is also possible to call ListRecords without a set. However, such a request will return records from all sets, which will typically include a large amount of redundant data that will take a long time to deliver. We therefore recommend that clients not issue set-less requests.)

A full harvest may include millions of records. The OAI service will return batches of between 500 and 10,000 records at a time. For each batch, the service will provide a resumption token that clients can use to get more records, until all records are harvested. Resumption requests take the form https://pod.stanford.edu/oai?verb=ListRecords&resumptionToken=$TOKEN where $TOKEN is the resumption token returned by the server. Each subsequent response will include a different resumption token, until all records have been returned in response to a request. Resumption tokens should be used in a timely manner, as their validity may eventually time out.

An incremental harvest from a provider institution

Once a consumer has completed a full harvest from a particular set, it can get updates from that set without having to re-harvest the entire set, using the incremental harvesting features of OAI-PMH.

A request of the form https://pod.stanford.edu/oai?verb=ListRecords&metadataPrefix=marc21&set=$SET&from=$DATE (where $SET is the set of interest and $DATE is the date a consumer's previous harvest began) will return all of the records that have been added, changed, or deleted since that prior harvest. As with the full harvest above, this response will be partitioned into groups of 1000 records or less, with resumption tokens, if necessary.

Consumers may continue to make further incremental harvests as often as desired, though the increments used in POD Aggregator harvests are full days. (Hence, there is no point in making, for example, hourly incremental harvests.) Consumers may also re-do full harvests when desired, and the records returned will reflect the current state of the set's stream.

Changing provider institution sets

Data providers will from time to time change the stream they use as their default; for example, when they want to start a fresh stream with a new full dump of records. (They should not make that stream default until they have fully populated that stream, however.) Data consumers will most likely want to start harvesting from the set for that new stream, since the old set is unlikely to get further updates. Consumers can use ListSets as above to discover when a provider's default set has changed. They can then do a full harvest from that set to get a full, up-to-date set of records, and also do further incremental updates.

There is no provision for incremental harvests between different streams or sets. Rather, consumers will need to do a full harvest to get a complete set of records from a new default set. (Hopefully, no provider will change its default set too often, though. We recommend doing so no more than once a quarter if possible.)

Due to the characteristics of the OAI-PMH protocol, the OAI-PMH IDs of the records in each OAI-PMH set will be different, even when they were derived from the same underlying record in the data provider's catalog. However, it is possible for a harvester to tell when a record from one provider set comes from the same underlying record in a different set from the same provider. For example, the internal record ID, as recorded in the 001 MARC field, will be the same. The OAI-PMH IDs are also formatted to end with the provider's internal identifier, after a colon.

Other OAI verbs

Verb URL Description
Identify https://pod.stanford.edu/oai?verb=Identify Returns basic information about the server and its capabilities.
ListMetadataFormats https://pod.stanford.edu/oai?verb=ListMetadataFormats Returns information on the metadata formats the server supports. Currently this is just MARC XML (marc21). Dublin Core, though required by the OAI-PMH specification, is not currently supported.
ListIdentifiers https://pod.stanford.edu/oai?verb=ListIdentifiers&metadataPrefix=marc21&set=$SET Not currently supported by the POD Aggregator OAI-PMH service. (If implemented, it would work like ListRecords, but return headers instead of full records.)
GetRecord https://pod.stanford.edu?verb=GetRecord&identifier=$ID&metadataPrefix=marc21 Not currently supported by the POD Aggregator OAI-PMH service. (If implemented, it would return the contents of the record with the identifier $ID, in MARC XML format.)