Usage - NatLibFi/RecordManager GitHub Wiki

RecordManager Usage

Here are some basic instructions for using RecordManager.

Main functionality is accessed using the ./console utility. Run it without parameters for a summary of available functionality. Use the --help parameter with a command to see its options, e.g. ./console solr:update-index --help. You can abbreviate commands as long as they remain unambiguous, e.g. ./console so:up for "solr:update-index".

Note that by default RecordManager does not follow HTTP redirects (status 302). You can enable redirects with the appropriate setting in recordmanager.ini HTTP section.

General Workflow

The general workflow is:

  1. Get data into RecordManager's database

     ./console records:import source_id source_records.xml
    

or

    ./console records:harvest
  1. Deduplicate (Required only if deduplication is used)

     ./console records:deduplicate
    
  2. Load records into Solr

     ./console solr:update-index
    

It is also possible to test updating Solr with a single record to see what's going in:

./console solr:update-index --single=source.record_id --verbose --no-commit

Use a value of the "_id" field in the database for "source.record_id". Verbose flag makes RecordManager display much more information on the screen, and no-commit avoids the final commit so that the script ends quickly, but the record may not be immediately visible in Solr.

It is also possible to renormalize a record or a data source if there was a problem with the normalization rules:

./console records:renormalize --source=xyz --verbose

or

./console records:renormalize --single=source.record_id --verbose

See Command-Line Reference for all available options.

Examples

First of all, the general settings are entered in conf/recordmanager.ini and all data sources are defined in conf/datasources.ini.

The logging facility in RecordManager can send email alerts when fatal errors occur (configured in recordmanager.ini). This helps monitor the system when processes are run in unattended mode (e.g. from crontab).

See Configuration for more information on the configuration files.

Please note that RecordManager's record classes don't currently handle namespaces properly, so take care to get rid of them during the import or harvesting process.

MARC via OAI-PMH

Here is a sample configuration for an OAI-PMH repository providing records in MARC format:

[samplesource]
url = http://oai-phm-provider/base-path
metadataPrefix = marc21
;verbose = true
;debuglog = oai.log
institution = MyInst
format = marc
dedup = true
componentParts = as_is

The harvesting can be started with the harvest command (if the source parameter is not provided, all data sources are harvested):

 ./console records:harvest --source=sampleinstitution

DC via OAI-PMH

Here is another sample for harvesting dublin core records that require some normalization:

[samplesource_subset]
institution = MyInst
url = http://oai-phm-provider/base-path
metadataPrefix = oai_dc
set = oai-pmh-set
oaipmhTransformation = strip_namespaces.xsl 
normalization = helmet.properties 
format = dc
dedup = true
idPrefix = samplesource

In this case oaipmhTransformation points to an XSL transformation that is done for the OAI-PMH responses before any further processing. normalization points to a normalization transformation configuration file. Everything related to transformations resides in transformations directory. idPrefix can be used if there are multiple data source definitions for the actual data source (i.e. need to harvest a couple of different OAI-PMH sets from a single database).

DC from a File

Here is yet another sample for importing from files dublin core records that require some normalization:

[samplesource_subset]
institution = MyInst
url = http://oai-phm-provider/base-path
metadataPrefix = oai_dc
set = oai-pmh-set
preTransformation = strip_namespaces.xsl 
normalization = helmet.properties 
format = dc
dedup = true
idPrefix = samplesource

In this case preTransformation points to an XSL transformation that is done for the file to be imported before any further processing.

Deduplication

When harvesting or importing is done, deduplication can be run:

./console records:deduplicate

Deduplication will go through all data sources with dedup=true in their settings and create deduplication links between the records. Deduplication keys can be used to e.g. group the records during search, or a custom export routine that combines the records could be created.

Renormalization

If normalization rules are changed, normalization can be re-executed without harvesting records anew:

./console records:renormalize --source=<datasource>

Note that if deduplication is enabled for the data source, deduplication should be run after renormalization.

Updating Solr

When all is done, let's update the Solr index with any new, changed or deleted records:

./console solr:update-index

By default, RecordManager tracks last Solr update date and only processes new changes. --from parameter can be used to force processing from a given date (e.g. --from=2011-01-01). --all parameter instructs RecordManager to disregard the dates completely and load all records.

Deleting a Data Source

It is important to follow specific steps when deleting a data source, since the records may be indexed in Solr and may also be part of a group of deduplicated records. Because of this, the recommended steps to delete a data source are:

  1. Mark the data source deleted:

     ./console records:mark-deleted --source=<datasource>
    
  2. Run deduplication (if in use):

     ./console records:deduplicate
    
  3. Update Solr index:

     ./console solr:update-index
    
  4. Delete the data source from RecordManager's database:

     ./console records:delete-source <datasource> --force
    
  5. Remove the configuration from datasources.ini

Records in the Mongo Database

RecordManager stores the records in a MySQL/MariaDB database or a MongoDB database defined in recordmanager.ini (default recman). In MongoDB the records are stored in the record collection. Records from different sources are distinguished using the source prefix in their ID's (_id in MongoDB) and source_id field.

You can start the Mongo shell from command line using the command:

mongo recman

In the shell e.g. the following command can be used to display records:

db.record.find()
⚠️ **GitHub.com Fallback** ⚠️