Use Case Recipes - UCLALibrary/resourcesync-oai-pmh GitHub Wiki
What follows outlines the intended use of the software.
Each set of resources to be made available for discovery via ResourceSync needs to be processed with source.py
. Those resources must already made available for discovery in Dublin Core format via OAI-PMH over HTTP.
As an example, assume the following scenario:
- your OAI-PMH provider's base URL is
http://example.com/oai/provider
(an example request for which could behttp://example.com/oai/provider?verb=ListRecords&set=testcol&metadataPrefix=oai_dc
) - the collection's setSpec is
testcol
- you will host the ResourceSync documents at
http://test.com
with the Apache 2 HTTP Server - you want the ResourceSync documents for this collection to be created under
http://test.com/resourcesync/testcol/
(for example, ashttp://test.com/resourcesync/testcol/capabilitylist.xml
,http://test.com/resourcesync/testcol/resourcelist_0000.xml
,http://test.com/resourcesync/testcol/changelist_0000.xml
, etc.)
The very first time you process this collection, you will generate a ResourceList at http://test.com/resourcesync/testcol/resourcelist_0000.xml
(or http://test.com/resourcesync/testcol/resourcelist-index.xml
, if more than 50000 records) by running:
sudo python3 source.py \
single \
http://test.com \
apache \
http://example.com/oai/provider \
oai_dc \
resourcelist \
testcol
ResourceList generation must happen exactly once per collection. If this was the first time that your institution generated a ResourceList for any of its collections, a SourceDescription will be created at http://test.com/.well-known/resourcesync
. In any case, a new CapabilityList (with a <url>
entry pointing to the new ResourceList) will be created at http://test.com/resourcesync/testcol/capabilitylist.xml
, and a <url>
entry pointing to the new CapabilityList will be added to the SourceDescription.
Whenever changes are made to resources (records) in this collection, you need to create (or update) a ChangeList at http://test.com/resourcesync/testcol/changelist_0000.xml
(or http://test.com/resourcesync/testcol/changelist-index.xml
, if more than 50000 changes have occurred) by running:
sudo python3 source.py \
single \
http://test.com \
apache \
http://example.com/oai/provider \
oai_dc \
inc_changelist \
testcol
The only difference between the two preceding commands is the strategy
parameter (resourcelist
vs. inc_changelist
). ChangeList generation can occur any number of times for each collection. The first time that a ChangeList is generated for a collection, a <url>
entry pointing to that ChangeList is added to the collection's CapabilityList.
If you have many collections to generate ResourceSync documents for, you can use the multi
subcommand for either ResourceList or ChangeList generation (or both) by passing it a CSV file containing the parameters for each collection constructed according to the schema described here (one row per collection):
sudo python3 source.py \
multi \
collections.csv
A couple points of clarification:
- ResourceList generation must happen exactly ONCE per collection. Otherwise, a destination may remain permanently out of sync with the source! If this happens by accident, you must contact the content aggregator institution immediately in order to bring things back into sync.
- ChangeList generation may be scheduled regularly, with
cron
for example. - If you're hosting with either Tomcat 7 or Apache (httpd) 2 (with document roots set to
/usr/share/tomcat/webapps/default
and/var/www/html
, respectively), you can usetomcat
orapache
for theresourcesync-server-document-root
parameter; otherwise, you must explicitly specify the server's document root directory (e.g., for Tomcat 6, it would be/usr/local/tomcat6/webapps/default
). - On OAI-PMH base URLs:
- Some institutions have a different base URL for different collections (e.g.,
http://example.com/abc/oai/provider
,http://example.com/def/oai/provider
, etc.), andListRecords
andListIdentifiers
requests may not require anyset
query paramater. In that case, when invokingsource.py
:- If using
single
mode, the--no-set-param
flag must be passed on the command line. - Otherwise (using
multi
mode), theno-set-param
column for such collections must NOT be empty (put anything you want in the column, except a string containing a,
).
- If using
- Otherwise, your institution has one base URL for all collections. When invoking
source.py
:- For
single
mode, the--no-set-param
flag must NOT be passed on the command line. - For
multi
mode, theno-set-param
column for each collection must be empty.
- For
- Some institutions have a different base URL for different collections (e.g.,
- For full usage information:
python3 source.py --help
.
I want to populate a Solr index with OAI-PMH resources from one or more content providers. I want to use ResourceSync to do this. I have a local TinyDB instance at
/my/tiny/db.json
with one row per resource set (according to the schema) and a Solr index athttp://example.com/solr/resourcesync
. I want to update Solr every Sunday at 2 AM.
Edit the config file to point to the DB and Solr index, and add this line to /etc/crontab
:
# /etc/crontab
...
0 2 * * 0 root python3 destination.py