Post Harvest Processing Workflow - VertNet/post-harvest-processor GitHub Wiki

The post-harvest processor is a series of steps in a broader data mobilization workflow employed by VertNet (see https://github.com/VertNet/toolkit/wiki/Mobilization-Steps). Following are the steps in that workflow, with the post-processor steps highlighted in bold.

"Migrate" data from original form to Darwin Core CSV (and extensions) with VertNet toolkit ("migrator"): https://github.com/VertNet/toolkit/wiki
Publish data set to IPT: https://github.com/gbif/ipt/wiki/IPT2ManualNotes.wiki
Harvest to Google Cloud Storage with gulo: https://github.com/VertNet/gulo/wiki/Harvest-Workflow
Add or update data set metadata in VertNet Carto resource_staging table
Remove any previous harvest files from the GCS directory tree 'data' for the data source. The harvest files will be in a folder for the date on which the resource was last harvested.
Remove any previous post-harvest files from the GCS directory tree 'processed' for the data source
Update harvestfolder field in VertNet Carto resource_staging table
Replace Carto resource table from the resource_staging master copy using the following:

DELETE from resource

, then

INSERT INTO resource
(
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi) 
SELECT 
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count::integer, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi 
from resource_staging 
where ipt=True and networks like '%Vert%'

Run check_harvest_folder_GCS.py for the parent harvest folder of interest:

python check_harvest_folder_GCS.py -b vertnet-harvesting/data/2018-09-21/% -c [cartodb_id]

where -c refers to the Carto API key and -b is the Google Cloud Storage harvest folder for the harvest date to check.

If the environment does not encounter the necessary libraries, use a virtual environment with characteristics shown below.

See https://cloud.google.com/python/setup#installing_and_using_virtualenv See https://cloud.google.com/python/setup See https://virtualenv.pypa.io/en/stable/userguide/

If the environment has already been built, it can be activated from the ./lib folder with:

source env/bin/activate

Otherwise, build it and activate it in order to run check_harvest_folder_GCS.py. See https://stackoverflow.com/questions/4757178/how-do-you-set-your-pythonpath-in-an-already-created-virtualenv for how to invoke a distinct PYTHONPATH from activate and to restore the pre-env PYTHONPATH in deactivate.

Then, in ./lib...

pip2 install --upgrade virtualenv
virtualenv --python python2.7 env
source env/bin/activate
pip install --upgrade google-cloud-bigquery
pip install httplib2
pip install apiclient
pip install --upgrade google-api-python-client
pip install arrow
pip install unidecode

For harvest_resource_processor.py. In addition, in ./lib...

pip install regex
import google
print google.__path__

Update gcloud components to latest versions before proceeding:

gcloud components update

Run harvest_resource_processor.py to process data sets in resource_staging.csv. Examples:

python harvest_resource_processor.py -b vertnet-harvesting/data/2018-09-21/% -c [cartodb_id]

For a specific Institution:

python harvest_resource_processor.py -b vertnet-harvesting/data/2018-11-04/CHAS% -c [cartodb_id]

where -c refers to the Carto API key and -b is the Google Cloud Storage parent harvest folder to check.

Turn off the environment with deactivate.

Remove data set from index (if necessary, for new occurrenceIDs or records removed): https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow
Index: https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow

Load BigQuery

Load files from GCS directory tree 'processed' into BigQuery using bigquery_loader.py
Check for duplication and omission following the loading process. Find data sets that have duplicates using the following pattern:

SELECT icode, gbifdatasetid, occurrenceid, count(*) as reps
FROM dumps.full_20171112
WHERE occurrenceid is not null
group by icode, gbifdatasetid, occurrenceid
having reps>1

and remove any sets that have duplicates using the pattern:

DELETE
FROM dumps.full_20171112
WHERE gbifdatasetid='[]'

Check for data sets using the pattern:

SELECT icode, gbifdatasetid, count(*) as reps
FROM dumps.full_20171112
group by icode, gbifdatasetid
order by icode, gbifdatasetid ASC

Reload only missing data sets with bigquery_loader.py until the set is complete.

Remove any resources that should not be in the BigQuery snapshot by removing that resource from the GCS directory tree 'processed'.
Create Taxon Subset snapshots: https://github.com/VertNet/post-harvest-processor/wiki/Making-Snapshots