Usage - NatLibFi/RecordManager GitHub Wiki
Here are some basic instructions for using RecordManager.
Main functionality is accessed using the ./console
utility. Run it without parameters for a summary of available functionality. Use the --help
parameter with a command to see its options, e.g. ./console solr:update-index --help
. You can abbreviate commands as long as they remain unambiguous, e.g. ./console so:up
for "solr:update-index".
Note that by default RecordManager does not follow HTTP redirects (status 302). You can enable redirects with the appropriate setting in recordmanager.ini HTTP section.
The general workflow is:
-
Get data into RecordManager's database
./console records:import source_id source_records.xml
or
./console records:harvest
-
Deduplicate (Required only if deduplication is used)
./console records:deduplicate
-
Load records into Solr
./console solr:update-index
It is also possible to test updating Solr with a single record to see what's going in:
./console solr:update-index --single=source.record_id --verbose --no-commit
Use a value of the "_id" field in the database for "source.record_id". Verbose flag makes RecordManager display much more information on the screen, and no-commit avoids the final commit so that the script ends quickly, but the record may not be immediately visible in Solr.
It is also possible to renormalize a record or a data source if there was a problem with the normalization rules:
./console records:renormalize --source=xyz --verbose
or
./console records:renormalize --single=source.record_id --verbose
See Command-Line Reference for all available options.
First of all, the general settings are entered in conf/recordmanager.ini and all data sources are defined in conf/datasources.ini.
The logging facility in RecordManager can send email alerts when fatal errors occur (configured in recordmanager.ini). This helps monitor the system when processes are run in unattended mode (e.g. from crontab).
See Configuration for more information on the configuration files.
Please note that RecordManager's record classes don't currently handle namespaces properly, so take care to get rid of them during the import or harvesting process.
Here is a sample configuration for an OAI-PMH repository providing records in MARC format:
[samplesource]
url = http://oai-phm-provider/base-path
metadataPrefix = marc21
;verbose = true
;debuglog = oai.log
institution = MyInst
format = marc
dedup = true
componentParts = as_is
The harvesting can be started with the harvest command (if the source parameter is not provided, all data sources are harvested):
./console records:harvest --source=sampleinstitution
Here is another sample for harvesting dublin core records that require some normalization:
[samplesource_subset]
institution = MyInst
url = http://oai-phm-provider/base-path
metadataPrefix = oai_dc
set = oai-pmh-set
oaipmhTransformation = strip_namespaces.xsl
normalization = helmet.properties
format = dc
dedup = true
idPrefix = samplesource
In this case oaipmhTransformation points to an XSL transformation that is done for the OAI-PMH responses before any further processing. normalization points to a normalization transformation configuration file. Everything related to transformations resides in transformations directory. idPrefix can be used if there are multiple data source definitions for the actual data source (i.e. need to harvest a couple of different OAI-PMH sets from a single database).
Here is yet another sample for importing from files dublin core records that require some normalization:
[samplesource_subset]
institution = MyInst
url = http://oai-phm-provider/base-path
metadataPrefix = oai_dc
set = oai-pmh-set
preTransformation = strip_namespaces.xsl
normalization = helmet.properties
format = dc
dedup = true
idPrefix = samplesource
In this case preTransformation points to an XSL transformation that is done for the file to be imported before any further processing.
When harvesting or importing is done, deduplication can be run:
./console records:deduplicate
Deduplication will go through all data sources with dedup=true in their settings and create deduplication links between the records. Deduplication keys can be used to e.g. group the records during search, or a custom export routine that combines the records could be created.
If normalization rules are changed, normalization can be re-executed without harvesting records anew:
./console records:renormalize --source=<datasource>
Note that if deduplication is enabled for the data source, deduplication should be run after renormalization.
When all is done, let's update the Solr index with any new, changed or deleted records:
./console solr:update-index
By default, RecordManager tracks last Solr update date and only processes new changes. --from
parameter can be used to force processing from a given date (e.g. --from=2011-01-01
). --all
parameter instructs RecordManager to disregard the dates completely and load all records.
It is important to follow specific steps when deleting a data source, since the records may be indexed in Solr and may also be part of a group of deduplicated records. Because of this, the recommended steps to delete a data source are:
-
Mark the data source deleted:
./console records:mark-deleted --source=<datasource>
-
Run deduplication (if in use):
./console records:deduplicate
-
Update Solr index:
./console solr:update-index
-
Delete the data source from RecordManager's database:
./console records:delete-source <datasource> --force
-
Remove the configuration from datasources.ini
RecordManager stores the records in a MySQL/MariaDB database or a MongoDB database defined in recordmanager.ini (default recman). In MongoDB the records are stored in the record collection. Records from different sources are distinguished using the source prefix in their ID's (_id in MongoDB) and source_id field.
You can start the Mongo shell from command line using the command:
mongo recman
In the shell e.g. the following command can be used to display records:
db.record.find()