Mobilization Steps - VertNet/toolkit GitHub Wiki
Index to Documentation
VertNet Portal
- Portal Documentation: https://github.com/VertNet/webapp/wiki/VertNet-Portal
Migrating
- Migrator Workflow: https://github.com/VertNet/toolkit/wiki/Migrator-Workflow
- Migration History: https://vertnet.cartodb.com/tables/resource_staging (migrator column)
Publishing
- IPT Manual: https://github.com/gbif/ipt/wiki/IPT2ManualNotes.wiki
- VertNet-hosted Data Sets, VertNet IPT: http://ipt.vertnet.org:8080/ipt/
- VertNet Custom Data Sets: http://ipt.vertnet.org:8080/iptstrays/
- Publishing History: https://github.com/VertNet/tasks/issues
Harvesting
- Harvest Workflow: https://github.com/VertNet/gulo/wiki/Harvest-Workflow
- Harvest History: https://vertnet.cartodb.com/tables/resource_staging (harvestfolder column)
- Harvest Queue: https://github.com/VertNet/tasks/issues?q=is%3Aissue+is%3Aopen+label%3Aharvest
Post-Harvest Processing
- Post-Harvest Workflow: https://github.com/VertNet/post-harvest-processor/wiki/Post-Harvest-Processing-Workflow
- Post-Harvest Processing History: https://console.cloud.google.com/storage/browser/vertnet-harvesting/processed/?project=vertnet-portal
- Post-Harvesting BigQuery Snapshot: https://bigquery.cloud.google.com/table/vertnet-portal:dumps.vertnet_latest
- Post-Harvest Taxon Snapshot creation: https://github.com/VertNet/post-harvest-processor/wiki/Making-Snapshots
Indexing
- Index Workflow: https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow
- Index List: https://github.com/VertNet/dwc-indexer/wiki/Index-List
- Harvest History: https://vertnet.cartodb.com/tables/resource_staging (harvestfolder column)
Workflow Steps
The data mobilization workflow for VertNet involves the following steps:
- Pre-publication data preparation with VertNet toolkit ("migrator"): https://github.com/VertNet/toolkit/wiki
- Publish to IPT: http://ipt.vertnet.org:8080/ipt/
- Update data set metadata in VertNet Carto resource_staging table. When complete, use the SQL window to execute
DELETE from resource
, then
INSERT INTO resource
(
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi)
SELECT
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count::integer, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi
from resource_staging
where ipt=True and networks like '%Vert%'
- Harvest to Google Cloud Storage with gulo: https://github.com/VertNet/gulo/wiki/Harvest-Workflow
- Update harvestfolder field in VertNet CartoDB table 'resource_staging'
- Export resource_staging.csv from CartoDB
- Run post-harvest processor check_harvest_folders.py for data sets in resource_staging.csv
- Run post-harvest processor harvest_resource_processor.py to process data sets in resource_staging.csv
- Check Google Cloud Storage directory tree vertnet-harvesting/processed for duplicates and counts
- Remove data sets from the index that have had changes to the identifier scheme using the dwc-indexer
- Index any datasets that need to be updated: https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow
- Load files from processed folders into BigQuery for data sets of interest specified by GCS directory using post-harvest processor bigquery-loader.py
- Create Taxon Subset snapshots: https://github.com/VertNet/post-harvest-processor/wiki/Making-Snapshots