Further Information and Considerations for Content Providers - UCLALibrary/resourcesync-oai-pmh GitHub Wiki

I. High-level architecture

The software for content providers is structured in the following way:

  • source.py - a single Python script with dependencies, the majority of which are installed via Pip
    • the script may optionally be scheduled to be run via cron, for example
  • source.ini - general configuration options for the script
  • source_logging.ini - logging configuration options

II. High-level description

source.py does the following, in order:

  • Read config (*.ini) files and process command-line arguments
  • Send HTTP GET requests to OAI-PMH data providers
  • Process responses from data providers
  • Write ResourceSync document files (e.g., SourceDescription, CapabilityList, ResourceList, ChangeList) to the local filesystem at the locations specified by command-line arguments
  • Write log file at the location specified by source_logging.ini, and log to STDOUT

III. Security considerations

A. Recommendations

Although we've done our best to mitigate any serious security issues and other bugs, the reality is that this software is to a large extent "experimental", its components built and maintained by a relatively small team of developers. Therefore, in production environments with "mission-critical" services, we highly recommend spinning up a dedicated machine on which to run the software and to host the output files (ResourceSync documents).

B. Software dependencies

The software mostly depends on standard, widely-used Python modules available on PyPI. The only exception is the primary dependency (py-resourcesync), developed primarily by software developers at LANL. There are numerous tests written for py-resourcesync, which is open sourced here: https://github.com/resourcesync/py-resourcesync.

C. User permissions

As the document roots of many popular web servers (e.g., Tomcat, httpd) are owned by users with elevated privileges, the simplest use of this software is to run it as a user with elevated privileges, or as a user with normal privileges except for special write privileges on the server document root.

If the script is unable to be run with elevated privileges, it can be invoked so that the output files are written outside the document root, as long as those output files are subsequently moved to the correct locations under the document root.

The software does not accept incoming HTTP requests or any other kind of user input, and since it resides on the back-end can only be invoked by either the system administrator or some scheduling utility such as cron, both who presumably are trustworthy.

IV. Target audience and input (Usage)

The target audience (indented user) of the software is the administrator of a system that is to become a ResourceSync hosting server, or the cron scheduling utility. For example usage, see https://github.com/UCLALibrary/resourcesync-oai-pmh/wiki/Use-Case-Recipes; for detailed usage, please download the software, install dependencies, and run python3 source.py --help.

V. Destination and output

The output of the software is:

  • ResourceSync documents (files) created or updated
  • INFO-level logging to STDOUT
  • DEBUG-level logging to a file specified in the configuration

A. Validating ResourceSync documents

After ResourceList generation, you should see the following files on your server if everything went well:

  • SourceDescription at https://example.edu/.well-known/resourcesync.
  • for each collection:
    • CapabilityList at https://example.edu/resourcesync/:collectionID/capabilitylist.xml.
    • ResourceList at https://example.edu/resourcesync/:collectionID/resourcelist_0000.xml.

After ChangeList generation for a collection, you'll also see a ChangeList at https://example.edu/resourcesync/:collectionID/changelist_0000.xml.

1. /.well-known/resourcesync

Per the ResourceSync specification, putting the SourceDescription at the suggested well-known URI /.well-known/resourcesync is not required. However, it is a good idea to put it there for interoperability and future-proofing:

  • The concept of "well-known" URIs was established by RFC 5785 in order to reserve URIs for (and thus facilitate discovery of) site-wide metadata, such as that offered by ResourceSync's capabilities
  • ResourceSync's well-known URI is registered with IANA, and thus /.well-known/resourcesync is effectively a URI reserved for the ResourceSync SourceDescription
  • As ResourceSync becomes more widely adopted and more clients become interested in a server's ResourceSync capabilities, servers with their ResourceSync SourceDescription accessible via /.well-known/resourcesync will be more easily discoverable

2. ResourceLists and ChangeLists

In both of these documents, you should see a list of <url><loc> elements containing OAI-PMH GetRecord URLs. The URLs should be well-formed and, when dereferenced, return an OAI-PMH GetRecord response body.

3. Media Types

SourceDescriptions, CapabilityLists, ResourceLists, and ChangeLists should all be served with Content-Type: text/xml. If you are using Amazon S3 to host these files, you may need to specify this explicitly.

B. Debug log

By default, the software writes INFO-level messages to standard output and DEBUG-level messages to a logfile at the path specified in source_logging.ini. Each invocation of source.py writes something like this to the logfile:

--- STARTING RUN ---

Logging to /xxx/source.log
...
Configuration directory: /yyy/.config/rspub/core
...
Wrote sitemap zzz.xml
...
---  ENDING RUN  ---

The most useful information for each invocation of the script is as follows:

  • "Logging to ..." - the location of the logfile
  • "Configuration directory: ..." - the location of the persistent configuration file for py-resourcesync (the primary third-party dependency). This file is useful to look at if you can't find the source description file, for example (it will be at ${description_dir}/.well-known/resourcesync).
  • "Wrote sitemap ..." - the location of each ResourceSync XML document that is written.
⚠️ **GitHub.com Fallback** ⚠️