Make File for AmphibiaWeb - VertNet/post-harvest-processor GitHub Wiki

[Adapted from https://github.com/VertNet/bigquery/wiki/Make-File-for-AmphibiaWeb]

This page describes the process for creating a subset snapshot file from the latest VertNet harvest for download from Google Cloud Storage. This example is to create a file for use by AmphibiaWeb.

Prerequisites: Make sure the table vertnet_latest_amphibians reflects the latest data from the most recent BigQuery fullDump (e.g., table full_20160901). See https://github.com/VertNet/post-harvest-processor/wiki/Making-Snapshots.

To create a current copy of the data set for AmphibiaWeb, delete the table vertnet_latest_amphibiaweb.

Select the table vertnet_latest_amphibians.

Click on the Show Options button.

In the Select Destination Table dialog:

  • Project: vertnet-portal
  • Dataset: dumps
  • Table ID: vertnet_latest_amphibiaweb

Select the following options:

  • Write Preference: Overwrite table
  • Results Size: Allow Large Results
  • Results Schema: Flatten Results
  • SQL dialect: Use Legacy SQL

Enter the following in the New Query text box:

SELECT 
dctype,
institutioncode,
collectioncode,
catalognumber,
occurrenceid,
associatedmedia,
basisofrecord,
eventdate,
year,
month,
day,
highergeography,
continent,
country,
countrycode,
stateprovince,
county,
locality,
verbatimelevation,
decimallatitude,
decimallongitude,
geodeticdatum,
coordinateuncertaintyinmeters,
georeferencedby,
georeferenceddate,
georeferenceprotocol,
georeferencesources,
scientificname,
order,
family,
genus,
specificepithet,
scientificnameauthorship,
dc_references
FROM [dumps.vertnet_latest_amphibians]

Click on the Run Query button.

When the table vertnet_latest_amphibians is complete, it can be exported to Google Cloud Storage for public access. To do so, move the cursor to the vertnet_latest_amphibians table in the dumps folder. click on the drop-down button and select 'Export table'. Select CSV as the export format, GZIP as the compression format, and gs://vertnet-byclass/vertnet_latest_amphibiaweb.csv.gz* as the Google Cloud Storage URI.

This will generate multiple shards of the latest VertNet amphibians for AmphibiaWeb on Google Cloud Storage. The job can be monitored in the BigQuery console by selecting Job History. When it is finished the shards on Google Cloud Storage will have to be processed to strip the header row from all but the first shard and concatenate them all into a single file. To do so, check the script process_vertnet_latest_amphibianweb-GCS.sh for the number of files to process and run the script on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage.

When it is finished, if there are no errors, go to the Google Cloud Storage console at https://console.developers.google.com/project/vertnet-portal/storage/browser/vertnet-byclass/. Click on the Shared Publicly check box to make the file openly accessible at http://storage.googleapis.com/vertnet-byclass/vertnet_latest_amphibiaweb.csv.gz.