Post Harvest Processing Workflow - VertNet/post-harvest-processor GitHub Wiki
The post-harvest processor is a series of steps in a broader data mobilization workflow employed by VertNet (see https://github.com/VertNet/toolkit/wiki/Mobilization-Steps). Following are the steps in that workflow, with the post-processor steps highlighted in bold.
- "Migrate" data from original form to Darwin Core CSV (and extensions) with VertNet toolkit ("migrator"): https://github.com/VertNet/toolkit/wiki
- Publish data set to IPT: https://github.com/gbif/ipt/wiki/IPT2ManualNotes.wiki
- Harvest to Google Cloud Storage with gulo: https://github.com/VertNet/gulo/wiki/Harvest-Workflow
- Add or update data set metadata in VertNet Carto resource_staging table
- Remove any previous harvest files from the GCS directory tree 'data' for the data source. The harvest files will be in a folder for the date on which the resource was last harvested.
- Remove any previous post-harvest files from the GCS directory tree 'processed' for the data source
- Update harvestfolder field in VertNet Carto resource_staging table
- Replace Carto resource table from the resource_staging master copy using the following:
DELETE from resource
, then
INSERT INTO resource
(
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi)
SELECT
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count::integer, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi
from resource_staging
where ipt=True and networks like '%Vert%'
- Run check_harvest_folder_GCS.py for the parent harvest folder of interest:
python check_harvest_folder_GCS.py -b vertnet-harvesting/data/2018-09-21/% -c [cartodb_id]
where -c refers to the Carto API key and -b is the Google Cloud Storage harvest folder for the harvest date to check.
If the environment does not encounter the necessary libraries, use a virtual environment with characteristics shown below.
See https://cloud.google.com/python/setup#installing_and_using_virtualenv See https://cloud.google.com/python/setup See https://virtualenv.pypa.io/en/stable/userguide/
If the environment has already been built, it can be activated from the ./lib folder with:
source env/bin/activate
Otherwise, build it and activate it in order to run check_harvest_folder_GCS.py. See https://stackoverflow.com/questions/4757178/how-do-you-set-your-pythonpath-in-an-already-created-virtualenv for how to invoke a distinct PYTHONPATH from activate and to restore the pre-env PYTHONPATH in deactivate.
Then, in ./lib...
pip2 install --upgrade virtualenv
virtualenv --python python2.7 env
source env/bin/activate
pip install --upgrade google-cloud-bigquery
pip install httplib2
pip install apiclient
pip install --upgrade google-api-python-client
pip install arrow
pip install unidecode
For harvest_resource_processor.py. In addition, in ./lib...
pip install regex
import google
print google.__path__
Update gcloud components to latest versions before proceeding:
gcloud components update
- Run harvest_resource_processor.py to process data sets in resource_staging.csv. Examples:
python harvest_resource_processor.py -b vertnet-harvesting/data/2018-09-21/% -c [cartodb_id]
For a specific Institution:
python harvest_resource_processor.py -b vertnet-harvesting/data/2018-11-04/CHAS% -c [cartodb_id]
where -c refers to the Carto API key and -b is the Google Cloud Storage parent harvest folder to check.
Turn off the environment with deactivate
.
- Remove data set from index (if necessary, for new occurrenceIDs or records removed): https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow
- Index: https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow
Load BigQuery
- Load files from GCS directory tree 'processed' into BigQuery using bigquery_loader.py
- Check for duplication and omission following the loading process. Find data sets that have duplicates using the following pattern:
SELECT icode, gbifdatasetid, occurrenceid, count(*) as reps
FROM dumps.full_20171112
WHERE occurrenceid is not null
group by icode, gbifdatasetid, occurrenceid
having reps>1
and remove any sets that have duplicates using the pattern:
DELETE
FROM dumps.full_20171112
WHERE gbifdatasetid='[]'
Check for data sets using the pattern:
SELECT icode, gbifdatasetid, count(*) as reps
FROM dumps.full_20171112
group by icode, gbifdatasetid
order by icode, gbifdatasetid ASC
Reload only missing data sets with bigquery_loader.py until the set is complete.
-
Remove any resources that should not be in the BigQuery snapshot by removing that resource from the GCS directory tree 'processed'.
-
Create Taxon Subset snapshots: https://github.com/VertNet/post-harvest-processor/wiki/Making-Snapshots