FcigDM_ANFC - AtlasOfLivingAustralia/ala-datamob GitHub Wiki

Australian National Fish Collection: primary collection management system darwincore export

Introduction

This is an implementation of a darwincore export, for one of the FCIG-OZCAM participants.

Artefacts and synopsis

<a href='Hidden comment:

'>

Item Short URL Details (or long URL)
This wiki page http://goo.gl/XfgHU FcigDM_ANFC
Source data system
Collection management software bespoke
Database backend texpress
Exporter's execution environment ?sh (linux command shell)
Adhoc query
Bulk export method
Schema reporting
DwC mapping perl script
Compression, transmission gzip, sftp to upload.ala.org.au
Output data Darwincore csv (simple-dwc) format with non-standard FCIG extensions

institutionCode "CSIRO"
dcterms:type "PhysicalObject"
basisOfRecord "PreservedSpecimen"

Data availability:

ANFC data before export
ANFC data at export not generally available - contact data manager
ANFC data after atlas (biocache) ingest http://goo.gl/Rh3Je
Completeness model http://goo.gl/Yv6k9 Google docs -> Data management -> CompletenessDwC -> anfc.dwccm.26
Source code http://goo.gl/yzV59 https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc

Behaviour

There are three parts to the mapping:

This document doesn't cover (refer to the data manager):

  • method used to extract data from the texpress database
  • method used to send data to the atlas

anfc_dwc.run.pl

The first export component is a perl script, anfc_dwc.run.pl, which is the entry point for mapping a previously generated report to the simple-dwc format – this script takes instruction through the command line for various input/output options:

usage: anfc_dwc.run.pl <IN> <OUT> <ERROR> <LOG>
 - all 4 arguments:  input file (or 'STDIN'), output file (or 'STDOUT'), error file, log file
 - 3 arguments:      input file, output file, error file (run log to stdout)
 - 2 arguments:      input file, output file (errors stderr, run log to stdout)
 - 1 argument:       input file (output stdout, errors stderr, no run log)
 - no arguments:     (input stdin, output stdout, errors stderr, no run log)

After establishing a valid environment it reads from the input stream and expects to find a header row followed by data rows. The field delimiter is defined in the script variable $reDelim (currently the sequence +| "plus pipe") and is used by the perl split() function to separate rows into fields for processing in the mapping scripts.

Each row is mapped individually, as it is encountered in the source file, via the map_record() function defined in anfc_dwc.map.pl. Mapping results - stored in a temporary variable @arr_output - are printed to the output stream (if possible) and any warnings or errors encountered during the mapping are stored in the %hErrors buffer. Under successful conditions the script also prints periodically to the log stream, with any error rows being specifically marked as well. At the end of input processing, the contents of the error buffer are flushed to the error stream in a structured format.


Activity diagram for https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc_dwc.run.pl

anfc_dwc.map.pl

This script defines the logic to transform ANFC source data to the target model; map_dwc_header() and map_record() are called by anfc_dwc.run.pl; the other functions - map_coords(), dPrecisionDDM(), isNum() and sConvertDDM() - are helpers.

map_dwc_header()

This function is called by anfc_dwc.run.pl to establish the output darwincore terms and their order in the output file. Results are stored in a hash %hDwcHdrInd (defined in anfc_dwc.run.pl) which is passed to map_record() every time it is called. This allows map_record() to store output values in corresponding indices without knowing the order in which darwincore terms appear in the output file.

Important note: any new dwc terms being added and referenced in mapping functions must be placed in here also, otherwise output behaviour is undefined. Also, the order in which dwc terms appear in the output can be affected by adjusting map_dwc_header() code.

map_record()

This function is called by anfc_dwc.run.pl for each valid data row. All of the system-specific logic for converting ANFC source data to simple-dwc should live here, or in helper functions called by map_record().


Activity diagram for https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc_dwc.map.pl map_record() function

Details on the steps in the previous diagram correspond with activity in anfc_dwc.map.pl:

  • 5.1 - map constants and 1:1 fields -
    • dwc.typestatus, (dwc.typename, dwc.typeauthor, dwc.typeyear)
    • dwc.basisOfRecord, dcterms:type
    • dcterms:rightsHolder, dwc.institutionCode, dwc.collectionId
    • dwc.occurrenceStatus
    • dwc.family, dwc.genus, dwc.specificEpithet, dwc.infraspecificName, dwc.scientificName
    • dwc.minimumdepth, dwc.maximumdepth
  • 5.2 - map dwc.country, dwc.stateProvince, dwc.waterBody and dwc.verbatimLocality:
    (see anfc_dwc.map-cntryocn.pl for map_find_*)
    • if country doesn't equal 'yes', 'no' or 'unknown'
      1. warning is logged
      2. dwc.verbatimLocality, dwc.country, dwc.stateProvince and dwc.waterBody unpopulated
    • if country equals 'yes' :
      1. dwc.country set to 'Australia'
      2. test state_territory against map_find_state(): success to dwc.stateProvince
      3. dwc.verbatimLocality populated with source data
      4. dwc.waterBody unpopulated
    • if country equals 'no' :
      1. test state_territory against map_find_country(): success to dwc.country
      2. otherwise, test state_territory against map_find_ocean(): success to dwc.waterBody
      3. otherwise, warning is logged
      4. dwc.verbatimLocality populated with source data
      5. dwc.stateProvince unpopulated
    • if country equals 'unknown' :
      1. test state_territory against map_find_ocean(): success to dwc.waterBody
      2. dwc.verbatimLocality populated with source data
      3. dwc.country and dwc.stateProvince unpopulated
  • 5.3 - map dwc.catalogNumber, dwc.otherCatalogNumbers, dwc.occurrenceId, dcterms:modified and dwc.verbatimModified, dwc.eventDate and dwc.verbatimEventDate
  • 5.4 - see map_coords()

map_coords()

Called from map_record() to handle populating -
  • dwc.decimalLatitude, dwc.decimalLongitude
  • dwc.footprintWKT
  • dwc.coordinatePrecision
  • note: dwc.verbatimLatitude and dwc.verbatimLongitude are populated with source data by map_record() regardless of result


Activity diagram for https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc_dwc.map.pl map_coords() function

anfc_dwc.map-cntryocn.pl

The third export component is a perl script that validates potential
dwc.country, dwc.stateProvince, dwc.waterBody and dwc.verbatimLocality values, converting them to new values if required. This function has been separated
for ease of update and to help improve readability within anfc_dwc.map.pl.

Functions map_find_country(), map_find_ocean() and map_find_state() are called all from
map_record() (defined in anfc_dwc.map.pl) during a broader validation routine described previously.

Value mappings found in this script indicate a valid source-value, and if requred, dictate a new
target value. If new candidate source values are required for these target fields then this script
should be updated to reflect the new values, however, if new logic is required to determine how values are populated, the reader is referred back to
anfc_dwc.map.pl.


Pentaho ktr usage documentation

Older pentaho transformation used to prototype the mapping to simple-dwc https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc%20cms%20doco.20130103.pdf...
⚠️ **GitHub.com Fallback** ⚠️