Source Server Specs - UCLALibrary/resourcesync-oai-pmh GitHub Wiki

If you are a content provider generating ResourceSync documents, we recommend a server that conforms to the following guidelines. It can either be a new machine dedicated specifically for this project or it can be an existing machine that is already serving other purposes.

The generated ResourceSync documents may even be hosted on the same server as your OAI-PMH repository, although this is not a requirement; the choice is yours depending on what is possible and convenient for your institution. However, note that such a configuration requires shell access to said server, and may not possible with pre-rolled SaaS solutions such as Digital Commons.

All this said, we highly recommend using a dedicated server for reasons stated here.

OS

source.py has only been tested on Linux-based operating systems, so use one of those for best results. The particular flavor shouldn't matter; we've tested on Ubuntu 16.04 and RHEL 6.9.

Disk

The amount of persistent storage (disk) required depends on:

  • the number of collections you have,
  • the number of items per collection (affects the size of ResourceLists), and
  • the number of resources that are expected to be created/updated/deleted (affects the size of ChangeLists).

Our first test collection's ResourceList and ChangeList (containing 5000 entries each) came out to no more than 1.5 MiB each (so, 300 B per entry). You can estimate your usage a couple of different ways:

A Rough Estimate

If you are able to estimate the average number of resources per collection and the average number of anticipated changes per collection, use this formula to calculate your institution's requirements:

<MIN_NUM_OF_BYTES> = (<NUM_RESOURCES_COLLECTION_AVG> + <NUM_ANTICIPATED_CHANGES_COLLECTION_AVG>) * <NUM_COLLECTIONS> * <NUM_BYTES_PER_RESOURCE_AVG>

A Closer Estimate

If you have more fine-grained information about your collections, use this formula:

<MIN_NUM_OF_BYTES> =

(

  (<NUM_RESOURCES_COLLECTION_1> + <NUM_ANTICIPATED_CHANGES_COLLECTION_1>) +

  (<NUM_RESOURCES_COLLECTION_2> + <NUM_ANTICIPATED_CHANGES_COLLECTION_2>) +

  ...

  (<NUM_RESOURCES_COLLECTION_N> + <NUM_ANTICIPATED_CHANGES_COLLECTION_N>)

) * <NUM_BYTES_PER_RESOURCE_AVG>

where:

  • <NUM_BYTES_PER_RESOURCE_AVG> is the average number of bytes per resource,
  • <NUM_RESOURCES_COLLECTION_i> is the number of resources that collectioni has when its ResourceList is generated for the first (and only) time, and
  • <NUM_ANTICIPATED_CHANGES_COLLECTION_i> is the number of total anticipated changes (create/update/delete) for this collection from the time the ResourceList is generated until the end of time.

You can expect the size of ResourceLists and ChangeLists to grow linearly with the number of entries in them.

Memory

The amount of memory required depends on:

  • whatever minimum that is required by your web server, and
  • the requirements of the py-resourcesync library (for which no official metrics have been made publicly available at this time, unfortunately).

We have profiled an invocation of source.py for ResourceList generation on a collection with 50000 resources, and the memory usage peaked at just under 100 MiB. We expect memory usage to grow linearly with the number of resources, but this has not yet been verified.

⚠️ **GitHub.com Fallback** ⚠️