Migrator Workflow (EN) - VertNet/toolkit GitHub Wiki

Introduction

Data Migrators are customized Microsoft Access databases that link to or import source data. Migrators:

process data into a Darwin Core CSV file and an optional Media Extension CSV file,
add new vocabulary to a Vocabulary Manager,
resolve vetted vocabularies to standard values from the Vocabulary Manager,
remove problematic non-printing characters,
track changes made to the source data, and,
create reports that detail potential data problems and recommend changes to be made in the original data source.

Migrator Template

This Github repository contains the latest version of the Migrator template. The version of the template can be determined by the date of the most recent entry in the ChangeLog.txt file. Each change to the template is logged in this file, making it relatively easy to upgrade older versions of migrators by looking at changes that have occurred in the template since the older version of the migrator was created. The template consists of a Access databases, scripts, and folders used to process source a data into files ready for upload to a resource on an instance of the GBIF Integrated Publishing Toolkit (IPT). The template is used as the basis of a customized migrator for each distinct resource on the IPT. The migrator can be join together multiple source data sets into an aggregated data set for a single resource on the IPT.

Migrator Customization

Every data source is unique and requires some customized treatment to transform it into Simple Darwin Core (and the optional Media extension, if necessary) and to prepare it for upload as an IPT resource. The majority of the customization occurs in queries and macros within two of the migrator’s Access databases contained in the templates folder. The first of these databases is named “DwC2ExtractTemplate-XXXX.mdb”, where XXXX is one of “Audio”, "Aves", "Eggs", “Ent”, "Fish", "Fossils", “Fungi”, "Herps", “Inverts”, "Mammals", “Plants”, "Verts" (depending on whether we want to run a migrator specifically for one of these groups). This database is used to link the original data source to the migrator and to perform the preliminary steps necessary to transform the original data into Darwin Core fields.

The second database, named AggregatorTemplate.mdb, contains queries and a macro ("Aggregate and Export") to combine distinct data sources and create an aggregate Simple Darwin Core Occurrence CSV file that is ready for upload to a resource on an IPT. The Aggregator must be invoked to create the Darwin Core CSV file even if there is only one data source. The macro must be modified to include the relevant migrated data sources (any combination of “Audio”, "Aves", "Eggs", “Ent”, "Fish", "Fossils", “Fungi”, "Herps", “Inverts”, "Mammals", “Plants”, "Verts").

The description of the steps to take to customize a migrator for a new data set and to how to run each step are given in the file "README_Instructions for use_EN.pdf" in the code page of this repository.

Vocabularies

The Migrator links to and uses a Vocabulary database, (VocabulariesMaster.mdb). The Vocabulary database contains lookup tables for various individual and combined Darwin Core terms and provides standardized values for verbatim values of terms found in source data. The Vocabulary lookup tables are populated whenever a migrator is run, placing values never before encountered in the vocabulary lookup tables. These new values must be vetted by someone with management authority over the vocabularies and standard values added for each new term.

Vocabularies are consulted during the course of the migration process to replace non-standard values or terms with their standard equivalents. In order to make sure that the standard values are included, the migrator has to be run once to populate the Vocabulary database with hitherto unknown values, then again after those values have been resolved to make the substitutions.

To resolve vocabularies, there is a separate Access database, VocabulariesManager.mdb. This database links to the tables in VocabulariesMaster.mdb and has a number of queries and macros to facilitate the management of the vocabularies.

CSV file exports of the latest contents of the Vocabulary lookup tables are kept in the DWCVocabs Github repository, as vocabs, within the master vocabularies. These are pushed to Github following any new vocabulary resolution, usually following the creation of a migrator for a new data source.

Review

The migrators generate a Simple Darwin Core CSV file (and optional Media Extension CSV file), and it is best practice to share for review with the data publisher before being uploaded to the IPT and made public. This allows the data publisher to get an idea of how the data will appear before authorizing their release.

During data processing in the migrator, a number of reports are generated for each data source and placed in a reports folder. These reports show where potential problems with data quality, formatting, or standardization have been found. It is best practice to share these reports with the data publisher prior to the first publication of the data set to the IPT so that the data publisher can determine if they would like to make changes at the source based on these reports before publishing the data. The data are published once the original data are deemed suitable for publication, perhaps after multiple cycles of running the migrator and review.

Among the reports shared with the data publishers are those for:

duplicate or missing catalog numbers,
years, months, and days out of range,
non-standard geographic regions names and indeterminate geography,
non-standard taxonomic names at ranks from genus and higher,
changes made in the published data versus the verbatim original, and,
non-printing characters in the data content that compromise data integrity when formatted for sharing as CSV files. A detailed description of the reports can be found within the migrator in the file “Report Explanation” in the folder “reports”.

Migrator Maintenance

So far, VertNet has maintained an archive of customized migrators for all of the data sources from organizations who opt to avail themselves of this service. New innovations are developed with each new migrator customization. These innovations are captured in the template (and in this Github repository) and are logged in the ChangeLog.txt file. VertNet has also maintained the vocabulary files. Currently, as more people is using the migrators, we are exploring ways to make the vocabularies a contribution effort, taking advantage of the vocabularies resolution from all participants and finding new ways to merge the standardized values for everyone to use.