FcigDM_ANFC - AtlasOfLivingAustralia/ala-datamob GitHub Wiki
This is an implementation of a darwincore export, for one of the FCIG-OZCAM participants.
<a href='Hidden comment:
'>
Item | Short URL | Details (or long URL) | |||||||||||||||
This wiki page | http://goo.gl/XfgHU | FcigDM_ANFC | |||||||||||||||
Source data system |
|
||||||||||||||||
Output data |
Darwincore csv (simple-dwc) format with non-standard FCIG extensions
Data availability:
|
||||||||||||||||
Completeness model | http://goo.gl/Yv6k9 | Google docs -> Data management -> CompletenessDwC -> anfc.dwccm.26 | |||||||||||||||
Source code | http://goo.gl/yzV59 | https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc |
There are three parts to the mapping:
- the main script (anfc_dwc.run.pl),
- main mapping script (anfc_dwc.map.pl),
- country/state/ocean values mapping (anfc_dwc.map-cntryocn.pl).
This document doesn't cover (refer to the data manager):
- method used to extract data from the texpress database
- method used to send data to the atlas
The first export component is a perl script, anfc_dwc.run.pl, which is the entry point for mapping a previously generated report to the simple-dwc format – this script takes instruction through the command line for various input/output options:
usage: anfc_dwc.run.pl <IN> <OUT> <ERROR> <LOG>
- all 4 arguments: input file (or 'STDIN'), output file (or 'STDOUT'), error file, log file
- 3 arguments: input file, output file, error file (run log to stdout)
- 2 arguments: input file, output file (errors stderr, run log to stdout)
- 1 argument: input file (output stdout, errors stderr, no run log)
- no arguments: (input stdin, output stdout, errors stderr, no run log)
After establishing a valid environment it reads from the input stream and expects to find a header row followed by data rows. The field delimiter is defined in the script variable $reDelim
(currently the sequence +|
"plus pipe") and is used by the perl split()
function to separate rows into fields for processing in the mapping scripts.
Each row is mapped individually, as it is encountered in the source file, via the map_record()
function defined in anfc_dwc.map.pl. Mapping results - stored in a temporary variable @arr_output
- are printed to the output stream (if possible) and any warnings or errors encountered during the mapping are stored in the %hErrors
buffer. Under successful conditions the script also prints periodically to the log stream, with any error rows being specifically marked as well. At the end of input processing, the contents of the error buffer are flushed to the error stream in a structured format.
Activity diagram for https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc_dwc.run.pl
map_dwc_header()
and map_record()
are called by anfc_dwc.run.pl; the other functions - map_coords()
, dPrecisionDDM()
, isNum()
and sConvertDDM()
- are helpers.This function is called by anfc_dwc.run.pl to establish the output darwincore terms and their order in the output file. Results are stored in a hash
%hDwcHdrInd
(defined in anfc_dwc.run.pl) which is passed to map_record()
every time it is called. This allows map_record()
to store output values in corresponding indices without knowing the order in which darwincore terms appear in the output file.Important note: any new dwc terms being added and referenced in mapping functions must be placed in here also, otherwise output behaviour is undefined. Also, the order in which dwc terms appear in the output can be affected by adjusting
map_dwc_header()
code.This function is called by anfc_dwc.run.pl for each valid data row. All of the system-specific logic for converting ANFC source data to simple-dwc should live here, or in helper functions called by
map_record()
.Activity diagram for https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc_dwc.map.pl
map_record()
function
Details on the steps in the previous diagram correspond with activity in anfc_dwc.map.pl:
- 5.1 - map constants and 1:1 fields -
- dwc.typestatus, (dwc.typename, dwc.typeauthor, dwc.typeyear)
- dwc.basisOfRecord, dcterms:type
- dcterms:rightsHolder, dwc.institutionCode, dwc.collectionId
- dwc.occurrenceStatus
- dwc.family, dwc.genus, dwc.specificEpithet, dwc.infraspecificName, dwc.scientificName
- dwc.minimumdepth, dwc.maximumdepth
- dwc.typestatus, (dwc.typename, dwc.typeauthor, dwc.typeyear)
- 5.2 - map
dwc.country
,dwc.stateProvince
,dwc.waterBody
anddwc.verbatimLocality
:
(see anfc_dwc.map-cntryocn.pl formap_find_*
)
- if country doesn't equal 'yes', 'no' or 'unknown'
- warning is logged
-
dwc.verbatimLocality
,dwc.country
,dwc.stateProvince
anddwc.waterBody
unpopulated
- warning is logged
- if country equals 'yes' :
-
dwc.country
set to 'Australia'
- test state_territory against
map_find_state()
: success todwc.stateProvince
-
dwc.verbatimLocality
populated with source data
-
dwc.waterBody
unpopulated
-
- if country equals 'no' :
- test state_territory against
map_find_country()
: success todwc.country
- otherwise, test state_territory against
map_find_ocean()
: success todwc.waterBody
- otherwise, warning is logged
-
dwc.verbatimLocality
populated with source data
-
dwc.stateProvince
unpopulated
- test state_territory against
- if country equals 'unknown' :
- test state_territory against
map_find_ocean()
: success todwc.waterBody
-
dwc.verbatimLocality
populated with source data
-
dwc.country
anddwc.stateProvince
unpopulated
- test state_territory against
- if country doesn't equal 'yes', 'no' or 'unknown'
- 5.3 - map
dwc.catalogNumber
,dwc.otherCatalogNumbers
,dwc.occurrenceId
,dcterms:modified
anddwc.verbatimModified
,dwc.eventDate
anddwc.verbatimEventDate
- 5.4 - see
map_coords()
map_record()
to handle populating --
dwc.decimalLatitude
,dwc.decimalLongitude
-
dwc.footprintWKT
-
dwc.coordinatePrecision
-
note:
dwc.verbatimLatitude
anddwc.verbatimLongitude
are populated with source data bymap_record()
regardless of result
Activity diagram for https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc_dwc.map.pl map_coords()
function
dwc.country
, dwc.stateProvince
, dwc.waterBody
and dwc.verbatimLocality
values, converting them to new values if required. This function has been separatedfor ease of update and to help improve readability within anfc_dwc.map.pl.
Functions
map_find_country()
, map_find_ocean()
and map_find_state()
are called all frommap_record()
(defined in anfc_dwc.map.pl) during a broader validation routine described previously.Value mappings found in this script indicate a valid source-value, and if requred, dictate a new
target value. If new candidate source values are required for these target fields then this script
should be updated to reflect the new values, however, if new logic is required to determine how values are populated, the reader is referred back to
anfc_dwc.map.pl.
Older pentaho transformation used to prototype the mapping to simple-dwc https://github.com/AtlasOfLivingAustralia/ala-datamob/tree/master/biodomains/fcig-ozcam/anfc/anfc%20cms%20doco.20130103.pdf...