Use Case Recipes - UCLALibrary/resourcesync-oai-pmh GitHub Wiki

What follows outlines the intended use of the software.

Content provider (source)

Each set of resources to be made available for discovery via ResourceSync needs to be processed with source.py. Those resources must already made available for discovery in Dublin Core format via OAI-PMH over HTTP.

As an example, assume the following scenario:

  • your OAI-PMH provider's base URL is http://example.com/oai/provider (an example request for which could be http://example.com/oai/provider?verb=ListRecords&set=testcol&metadataPrefix=oai_dc)
  • the collection's setSpec is testcol
  • you will host the ResourceSync documents at http://test.com with the Apache 2 HTTP Server
  • you want the ResourceSync documents for this collection to be created under http://test.com/resourcesync/testcol/ (for example, as http://test.com/resourcesync/testcol/capabilitylist.xml, http://test.com/resourcesync/testcol/resourcelist_0000.xml, http://test.com/resourcesync/testcol/changelist_0000.xml, etc.)

Generating a single ResourceList

The very first time you process this collection, you will generate a ResourceList at http://test.com/resourcesync/testcol/resourcelist_0000.xml (or http://test.com/resourcesync/testcol/resourcelist-index.xml, if more than 50000 records) by running:

sudo python3 source.py \
    single \
    http://test.com \
    apache \
    http://example.com/oai/provider \
    oai_dc \
    resourcelist \
    testcol

ResourceList generation must happen exactly once per collection. If this was the first time that your institution generated a ResourceList for any of its collections, a SourceDescription will be created at http://test.com/.well-known/resourcesync. In any case, a new CapabilityList (with a <url> entry pointing to the new ResourceList) will be created at http://test.com/resourcesync/testcol/capabilitylist.xml, and a <url> entry pointing to the new CapabilityList will be added to the SourceDescription.

Generating a single ChangeList

Whenever changes are made to resources (records) in this collection, you need to create (or update) a ChangeList at http://test.com/resourcesync/testcol/changelist_0000.xml (or http://test.com/resourcesync/testcol/changelist-index.xml, if more than 50000 changes have occurred) by running:

sudo python3 source.py \
    single \
    http://test.com \
    apache \
    http://example.com/oai/provider \
    oai_dc \
    inc_changelist \ 
    testcol

The only difference between the two preceding commands is the strategy parameter (resourcelist vs. inc_changelist). ChangeList generation can occur any number of times for each collection. The first time that a ChangeList is generated for a collection, a <url> entry pointing to that ChangeList is added to the collection's CapabilityList.

Generating multiple documents at a time

If you have many collections to generate ResourceSync documents for, you can use the multi subcommand for either ResourceList or ChangeList generation (or both) by passing it a CSV file containing the parameters for each collection constructed according to the schema described here (one row per collection):

sudo python3 source.py \
    multi \
    collections.csv

Summary

A couple points of clarification:

  • ResourceList generation must happen exactly ONCE per collection. Otherwise, a destination may remain permanently out of sync with the source! If this happens by accident, you must contact the content aggregator institution immediately in order to bring things back into sync.
  • ChangeList generation may be scheduled regularly, with cron for example.
  • If you're hosting with either Tomcat 7 or Apache (httpd) 2 (with document roots set to /usr/share/tomcat/webapps/default and /var/www/html, respectively), you can use tomcat or apache for the resourcesync-server-document-root parameter; otherwise, you must explicitly specify the server's document root directory (e.g., for Tomcat 6, it would be /usr/local/tomcat6/webapps/default).
  • On OAI-PMH base URLs:
    • Some institutions have a different base URL for different collections (e.g., http://example.com/abc/oai/provider, http://example.com/def/oai/provider, etc.), and ListRecords and ListIdentifiers requests may not require any set query paramater. In that case, when invoking source.py:
      • If using single mode, the --no-set-param flag must be passed on the command line.
      • Otherwise (using multi mode), the no-set-param column for such collections must NOT be empty (put anything you want in the column, except a string containing a ,).
    • Otherwise, your institution has one base URL for all collections. When invoking source.py:
      • For single mode, the --no-set-param flag must NOT be passed on the command line.
      • For multi mode, the no-set-param column for each collection must be empty.
  • For full usage information: python3 source.py --help.

Content aggregator (destination)

I want to populate a Solr index with OAI-PMH resources from one or more content providers. I want to use ResourceSync to do this. I have a local TinyDB instance at /my/tiny/db.json with one row per resource set (according to the schema) and a Solr index at http://example.com/solr/resourcesync. I want to update Solr every Sunday at 2 AM.

Edit the config file to point to the DB and Solr index, and add this line to /etc/crontab:

# /etc/crontab
...
0 2 * * 0 root python3 destination.py
⚠️ **GitHub.com Fallback** ⚠️