Toolchain: OAI PMH for Islandora repositories - MarcusBarnes/mik GitHub Wiki
Toolchain that creates Islandora import packages consisting of metadata and content files (PDFs, JPEGs, etc.) retrieved from an Islandora instance via the OAI-PMH protocol. The resulting Islandora import packages can then be ingested using the standard Islandora Batch module.
Note the following: This toolchain uses the same fetcher and metadata parser that the other MIK OAI-PMH toolchains use. Only the filegetter is specific to Islandora objects.
The filegetter is hard-coded to:
- parse out information required to retrieve a file for each OAI-PMH record (in this case, and Islandora PID)
- this filegetter does not parse Dublin Core metadata like the toolchain for repositories that identify resource files in a record element does, it gets the PID from the OAI-PMH
<identifier>
element
- this filegetter does not parse Dublin Core metadata like the toolchain for repositories that identify resource files in a record element does, it gets the PID from the OAI-PMH
- construct the direct URL to the file corresponding to the OBJ datastream using that information.
Specifically, this toolchain creates valid Islandora import packages using the OBJ datastream from objects having any of the single-file Islandora content models that have an OBJ datastream, such as basic image, large image. It will also harvest objects which may optionally have an OBJ datastream (PDF, video, and audio) but only if the OBJ datastream is present.
By default, this toolchain writes the Dublin Core metadata records retrieved from the OAI-PMH provider. These can be loaded into Islandora, but if you want to transform the Dublin Core into MODS for loading into Islandora, you can do so by using a post-write hook script as illustrated below in the [WRITER]
section.
All content added to Islandora import packages by this toolchain comes from the remote repository, so there is no need to prepare content.
All MIK configuration files are standard INI files which contain the following sections: [SYSTEM], [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.
Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.
This section of the configuration file sets or overrides configuration settings for PHP and the various third-party PHP components used by MIK. It can contain the following entries:
- date_default_timezone: Optional. Provide a default timezone if date.timezone is null in the the PHP INI. You will know if you need to use this setting because Monolog will throw MIK exceptions and halt MIK. Set to one of the valid PHP timezone values listed at http://php.net/manual/en/timezones.php.
- verify_ca: Optional. OSX's default PHP configuration use Apple's Secure Transport rather than OpenSSL, causing issues with Certificate Authority verification in Guzzle requests against websites that use HTTPS. This setting allows Guzzle to override CA verification. You will know if you need to use this setting because Guzzle will write entries in your mik.log complaining about CA verification. Set to
false
to ignore CA verification.
Note: if you set
verify_ca
tofalse
, you are bypassing HTTPS encryption between MIK and the remote website. Use at your own risk.
[SYSTEM]
date_default_timezone = 'America/Vancouver'
verify_ca = false
Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log
setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.
[CONFIG]
config_id = oai-test
last_updated_on = "2017-03-24"
last_update_by = "Mark Jordan"
This section of the configuration file must contain the following entries:
- class: Must be 'Oaipmh'.
- oai_endpoint: Full URL to the source Islandora's OAI-PMH endpoint.
- set_spec: Optional; the set spec that limits the OAI harvest to a specific set. Islandora's OAI-PMH provider assigns sets to collections; to use a set spec based on an Islandora collection, replace the
:
in the collection's PID with an underscore, e.g., a collection PID ofmy:collection
will have a set spec of `my_collection'. - from: Optional; a date in either YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ format that defines the start date in a selective harvest. Date-based harvests are described in the OAI-PMH spec.
- until: Optional; a date in either YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ format that defines the end date in a selective harvest.
- metadata_prefix: Optional; the metadata prefix to use. Default is 'oai_dc'. Use 'mods' to harvest MODS metadata.
- temp_directory: Full path to the directory where the fetchers write data for use later in the toolchain.
- use_cache: Optional; set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).
[FETCHER]
class = Oaipmh
oai_endpoint = "http://digital.lib.sfu.ca/oai2"
set_spec = hbc_collection
metadata_prefix = oai_dc
temp_directory = "/tmp/oaitest_temp"
This section of the toolchain's configuration file contains the following entries:
- class: Must be 'dc\OaiToDc'. Use 'mods\OaiToMods' to harvest MODS metadata.
[METADATA_PARSER]
class = dc\OaiToDc
This section of the toolchain's configuration file contains the following entries:
- class: Must be 'OaipmhIslandoraObj'.
- temp_directory: Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the
temp_directory
value used in the [FETCHER] section.
[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = "/tmp/oaitest_temp"
This section of the CSV toolchain's configuration file contains the following entries:
- class: Must be 'Oaipmh'.
- output_directory: The full path to the directory where output packages are written.
- postwritehooks[]: Repeated entries for each post-write hook script used in this toolchain. Currently, MIK ships with only one post-write script (shown in the example below), which applies an XSLT stylesheet to the OAI_DC metadata to transform it into MODS. This script overwrites the source Dublin Core XML file but creates a backup before doing so.
[WRITER]
class = Oaipm
output_directory = "/tmp/oaitest_output"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/oai_dc_to_mods.php"
This toolchain can use the SpecificSet and RandomSet fetcher manipulators. If you have have a use case for additional manipulators, please file an issue.
[MANIPULATORS]
; fetchermanipulators[] = "RandomSet|10"
fetchermanipulators[] = "SpecificSet|hbc_specific_set.txt"
The input file for the Specific Set manipulator must contain OAI-PMH identifiers in the same form that the OAI provider supplies them, e.g., colons may be URL escaped, as in this example:
oai%3Adigital.lib.sfu.ca%3Ahbc_12
oai%3Adigital.lib.sfu.ca%3Ahbc_13
This section of the CSV toolchain's configuration file contains the following entries:
- path_to_log: The full path to the standard log generated by MIK.
[LOGGING]
path_to_log = "/tmp/oaitest_output/mik.log"