Example of Automatic Importing to Hydra at WVU - wvulibraries/mfcs GitHub Wiki

Storage

In the MFCS config, the variable nfsexport allows you to define a shared export path. WVU does this with an NFS file share. Our hydra heads also mount this share, giving us shared storage between MFCS and our hydra head servers.

directory structure

The directory structure for each hydra head is created with the create-directory-structure.sh script. It relies on the HYDRA_PROJECT_NAME variable to create structure.

WVU uses the consistent naming convention of /home/HYDRA_PROJECT_NAME.lib.wvu.edu/HYDRA_PROJECT_NAME as the rails path for each head. We also have each head in a separate container/server. This allows us to use an ENV variable to control where we find out shared resources.

The shared resources are created as:

mkdir -p /mnt/nfs-exports/mfcs-exports/"$HYDRA_PROJECT_NAME"/control/
mkdir -p /mnt/nfs-exports/mfcs-exports/"$HYDRA_PROJECT_NAME"/control/mfcs
mkdir -p /mnt/nfs-exports/mfcs-exports/"$HYDRA_PROJECT_NAME"/control/hydra/error
mkdir -p /mnt/nfs-exports/mfcs-exports/"$HYDRA_PROJECT_NAME"/control/hydra/finished
mkdir -p /mnt/nfs-exports/mfcs-exports/"$HYDRA_PROJECT_NAME"/control/hydra/in-progress
mkdir -p /mnt/nfs-exports/mfcs-exports/"$HYDRA_PROJECT_NAME"/control/hydra/staged
mkdir -p /mnt/nfs-exports/mfcs-exports/"$HYDRA_PROJECT_NAME"/export

export script

example export script

Hydra is system agnostic. As a result it is up to each institution to develop their own export scripts. MFCS ships with dublin core export scripts, but anything that does not use dublin core will need a custom script. An example export script for exporting the PEC collection to Hydra, using the auto-import scripts, is here:

PEC Exporting as Json

The above scrip exports the metadata as json. If additional examples (such as exporting to XML) are needed, please contact us. We have many export scripts that export to XML, tab delimited, and CSV. As well as examples of saving digital items in gzipped files for convenient downloading.

example control file

The control file is a yaml file. When it is exported from MFCS the file name is a unix_time_stamp.yaml. When it is moved to the inprocess directory, it is renamed to control_file.yaml

---
  project_name: pec
  time_stamp: 1479678699
  # Export Type can be
  # 1. update : metadata for all objects, but not all digital items
  # 1. update_full : both metadata and digital items for all objects
  # 1. full : Same as update_full, but we assume that there is no data loaded
  #           This would be for an intial load
  # 1. partial : metadata update for some items and/or some digital objects
  export_type: update
  digital_items_count: 22
  record_count: 33
  # a yaml collection to contact when the import does not succeed
  contact_emails: 
    - [email protected]
    - [email protected]

project_name : must match the HYDRA_PROJECT_NAME env variable defined on the server. time_stamp : is the unix time when the exporting process occurred. We use this to make sure multiple exports get processed in the correct order. export_type : information, for debugging. digital_items_count : informations, for debugging. How many digital items were exported. This number is dependant on how the developer populated it in the export script. If there are multiple images per record, it could be a total count of all digital items OR it could be the count of records that have digital items. record_count : how many records were exported contact_emails : yaml list of emails that should be emailed when the import is complete, success or failure. the first one(s) are the global system administrators. After the global emails, it is the emails listed in the contacts section of the permissions on a form.

Importing

For automatic importing we run a series of scripts on head hydra head.

  1. check-for-jobs.rb
  2. process-jobs.rb
  3. import/import.rb

Note: the above script is what we use to import PEC's json into Hydra 7. If needed, we have examples of importing XML into Hydra 7 as well.

crontab

To get the automatic bit, everything is run via cron. This is out crontab on the PEC server:

PATH=/usr/local/sbin:/usr/sbin:/usr/local/bin:/usr/bin

*/1 * * * * ruby /opt/git_pull/hydra-import-scripts/src/crons/check-for-jobs.rb
*/5 * * * * cd /home/pec.lib.wvu.edu/pec; ruby /opt/git_pull/hydra-import-scripts/src/crons/process-jobs.rb

It is important to set the PATH env variable. If the Path variable isn't set for cron, rails will fail to run properly when called from the process-jobs.rb script.