Using Snapshots to Create Amazonian Biodiversity Index - VertNet/post-harvest-processor GitHub Wiki

[Adapted from https://github.com/VertNet/bigquery/wiki/Using-Snapshots-to-Create-Amazonian-Biodiversity-Index]

This page describes the process for using a BigQuery snapshot to create a thematic subset that can be indexed for a specialty portal. The example here is for the Dimensions of Amazonian Biodiversity Project (http://amazoniabiodiversity.vertnet-portal.appspot.com/).

Prerequisites: A full dump table with name of the form full_yyyymmdd has been made in BigQuery using the fullDump.py script (see https://github.com/VertNet/bigquery/tree/master/fullDump).

Create vernet_latest_amazonia Table

This step is unnecessary if the table vertnet_latest_amazonia is already populated with the latest full dump from the VertNet harvest.

In the BigQuery console, select the full dump table to use as a source. Click on Query Table. The SQL statement should be something like SELECT * FROM [dumps.vertnet_latest] LIMIT 1000. Remove the clause LIMIT 1000 and add whatever filter creates the subset of interest. For example, for the bounding box version of the Amazonian Diversity subset, use

SELECT * 
FROM [dumps.vertnet_latest]
WHERE
float(decimalLatitude)<13 and 
float(decimalLatitude)>-23 and 
float(decimalLongitude)<-34 and 
float(decimalLongitude)>-82

For the version of the Amazonian Biodiversity portal based on first-level administrative areas that contain the Amazon Biome.

SELECT * 
FROM [dumps.vertnet_latest]
WHERE
(
lower(country)='bolivia' and
(
lower(stateprovince) like '%beni%' or 
lower(stateprovince) like '%cochabamba%' or
lower(stateprovince) like '%la paz%' or 
lower(stateprovince) like '%pando%' or
lower(stateprovince) like '%santa cruz%'
)
)
or
(
(lower(country)='brasil' or lower(country)='brazil')) and
(
lower(stateprovince) like '%acre%' or
lower(stateprovince) like 'amap%' or
lower(stateprovince) like 'amazon%' or
lower(stateprovince) like 'maranh%' or
(lower(stateprovince) like '%mato grosso%' and lower(stateprovince) not like '%sul%') or 
(lower(stateprovince) like '%par%' and lower(stateprovince) not like '%paran%' and lower(stateprovince) not like '%ba%' and lower(stateprovince) not like '%ica%') or  
lower(stateprovince) like '%rond%' or 
lower(stateprovince) like '%roraima%' or
lower(stateprovince) like '%santa cat%' or
lower(stateprovince) like '%tocantins%'
)
)
or
(
lower(country)='colombia' and
(
lower(stateprovince) like '%amazon%' or
lower(stateprovince) like '%caquet%' or
(lower(stateprovince) like '%cauca%' and lower(stateprovince) not like '%valle%') or
lower(stateprovince) like '%caquet%' or
lower(stateprovince) like '%guain%' or
lower(stateprovince) like '%guaviare%' or
lower(stateprovince) like '%meta%' or
lower(stateprovince) like '%nari%' or 
lower(stateprovince) like '%putumayo%' or 
lower(stateprovince) like '%vaup%' or 
lower(stateprovince) like '%vichada%'
)
or
(
lower(country)='ecuador' and
(
lower(stateprovince) like '%morona%' or 
lower(stateprovince) like '%napo%' or 
lower(stateprovince) like '%orellana%' or 
lower(stateprovince) like '%pastaza%' or 
lower(stateprovince) like '%sucumb%'
)
or 
lower(country) like '%guiana%' or 
lower(country) like '%guyana%' or
(
lower(country) like 'per%' and
(
lower(stateprovince) like 'amazon%' or
lower(stateprovince) like 'ayacucho%' or
lower(stateprovince) like '%cu%co%' or
lower(stateprovince) like 'huancavelica%' or
lower(stateprovince) like '%jun%' or
lower(stateprovince) like '%lor%' or
lower(stateprovince) like '%madre%' or
lower(stateprovince) like '%pasco%' or
lower(stateprovince) like '%puno%' or
lower(stateprovince) like '%san mart%' or
lower(stateprovince) like '%ucayal%' or
)
or
lower(country) like 'surinam%' or 
lower(country) like 'venez%' and
( 
lower(stateprovince) like '%amazon%' or
lower(stateprovince) like '%anzo%' or
lower(stateprovince) like '%bol%v%' or
lower(stateprovince) like '%amac%' or
lower(stateprovince) like '%mon%'
)

Click on the Show Options button. Click on the Select Table button. Enter vertnet_latest_amazonia as the Table ID. Click on the OK button.

Select Overwrite Table. Check Allow Large Results. Check Flatten Results. Select Interactive.

Click on the Run Query Button. When the job is finished running the table vertnet_latest_amazonia will contain the filtered contents of the source dump table.

Create Indexable Subset Table

Prerequisites: The table vertnet_latest_amazonia is populated with the desired full dump from the VertNet harvest. If not, follow the step "Create vertnet_latest_amazonia Table".

The purpose of this step is to create a file structure matching those created by harvesting with gulo. For now, the Amazonia Portal uses the gulo harvest structure used in VertNet in 2014 (see the gulo branch dwc2013 https://github.com/VertNet/gulo/tree/dwc2013).

In the BigQuery console, select the vertnet_latest_amazonia table to use as a source. Click on Query Table. The SQL statement should be something like SELECT * FROM [dumps.vertnet_latest_amazonia] LIMIT 1000. To create the file to use for Amazonian Biodiversity indexing in the dwc2013 style, use

SELECT 
pubdate,
url,
eml,
dwca,
title,
icode,
"" as description,
contact,
orgname,
email,
emlrights,
"1" as count,
citation,
networks,
"" as harvestid,
id,
associatedmedia,
associatedoccurrences,
associatedreferences,
associatedsequences,
associatedtaxa,
basisofrecord,
bed,
behavior,
catalognumber,
collectioncode,
collectionid,
continent,
coordinateprecision,
coordinateuncertaintyinmeters,
country,
countrycode,
county,
datageneralizations,
dateidentified,
day,
decimallatitude,
decimallongitude,
disposition,
earliestageorloweststage,
earliesteonorlowesteonothem,
earliestepochorlowestseries,
earliesteraorlowesterathem,
earliestperiodorlowestsystem,
enddayofyear,
establishmentmeans,
"" as eventattributes,
eventdate,
eventid,
eventremarks,
eventtime,
fieldnotes,
fieldnumber,
footprintspatialfit,
footprintwkt,
formation,
geodeticdatum,
geologicalcontextid,
georeferenceprotocol,
georeferenceremarks,
georeferencesources,
georeferenceverificationstatus,
georeferencedby,
group,
habitat,
highergeography,
highergeographyid,
highestbiostratigraphiczone,
"" as identificationattributes,
identificationid,
identificationqualifier,
identificationreferences,
identificationremarks,
identifiedby,
individualcount,
organismid,
informationwithheld,
institutioncode,
island,
islandgroup,
latestageorhigheststage,
latesteonorhighesteonothem,
latestepochorhighestseries,
latesteraorhighesterathem,
latestperiodorhighestsystem,
lifestage,
lithostratigraphicterms,
locality,
"" as locationattributes,
locationid,
locationremarks,
lowestbiostratigraphiczone,
maximumdepthinmeters,
maximumdistanceabovesurfaceinmeters,
maximumelevationinmeters,
"" as measurementaccuracy,
"" as measurementdeterminedby,
"" as measurementdetermineddate,
"" as measurementid,
"" as measurementmethod,
"" as measurementremarks,
"" as measurementtype,
"" as measurementunit,
"" as measurementvalue,
member,
minimumdepthinmeters,
minimumdistanceabovesurfaceinmeters,
minimumelevationinmeters,
month,
"" as occurrenceattributes,
"" as occurrencedetails,
occurrenceid,
occurrenceremarks,
othercatalognumbers,
pointradiusspatialfit,
preparations,
previousidentifications,
recordnumber,
recordedby,
"" as relatedresourceid,
"" as relationshipaccordingto,
"" as relationshipestablisheddate,
"" as relationshipofresource,
"" as relationshipremarks,
reproductivecondition,
"" as resourceid,
"" as resourcerelationshipid,
samplingprotocol,
sex,
startdayofyear,
stateprovince,
"" as taxonattributes,
typestatus,
verbatimcoordinatesystem,
verbatimcoordinates,
verbatimdepth,
verbatimelevation,
verbatimeventdate,
verbatimlatitude,
verbatimlocality,
verbatimlongitude,
waterbody,
year,
footprintsrs,
georeferenceddate,
identificationverificationstatus,
institutionid,
locationaccordingto,
municipality,
occurrencestatus,
ownerinstitutioncode,
samplingeffort,
verbatimsrs,
"" as locationaccordingto7,
taxonid,
taxonconceptid,
datasetid,
datasetname,
"" as source,
modified,
accessrights,
license,
rightsholder, 
language,
higherclassification,
kingdom,
phylum,
class,
order,
family,
genus,
subgenus,
specificepithet,
scientificname,
scientificnameid,
vernacularname,
taxonrank,
verbatimtaxonrank,
"" as infraspecificmarker,
scientificnameauthorship,
nomenclaturalcode,
namepublishedin,
namepublishedinid,
taxonomicstatus,
nomenclaturalstatus,
nameaccordingto,
nameaccordingtoid,
parentnameusageid,
parentnameusage,
originalnameusageid,
originalnameusage,
acceptednameusageid,
acceptednameusage,
taxonremarks,
dynamicproperties,
namepublishedinyear,
FROM [dumps.vertnet_latest_amazonia]

Click on the Show Options button. Click on the Select Table button. Enter vertnet_latest_amazonia_index2013 as the Table ID. Click on the OK button.

Select Overwrite Table. Check Allow Large Results. Check Flatten Results. Select Interactive.

Click on the Run Query Button. When the job is finished running the table vertnet_latest_amazonia_index2013 will contain the filtered contents with the fields needed for gulo indexing.

In the BigQuery console, put the mouse cursor over the vertnet_latest_amazonia_index2013 table. An icon will appear that allows you to open a menu. Open the menu and select Export table. Select CSV. Select GZIP. Enter exactly the following for the Google Cloud Storage URI:

gs://vertnet-bigquery/vertnet_latest_amazonia_index2013.csv.gz

This will generate a file in Google Cloud Storage that will have to be processed to strip the header row and split the file into 10k records per file before indexing. To do so, download the file on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage. Then do the following steps.

gsutil cp gs://vertnet-bigquery/vertnet_latest_amazonia_index2013.csv.gz .
gunzip vertnet_latest_amazonia_index2013.csv.gz
sed '1d' vertnet_latest_amazonia_index2013.csv > amazonia.csv

At this point we have a csv file with no header row. Now turn the csv file into a tab-delimited file for indexing.

import csv
with open('amazonia.tab', 'wb') as tabfile:
    mywriter = csv.writer(tabfile, delimiter='\t', quotechar="", quoting=csv.QUOTE_NONE)
    with open('amazonia.csv', 'rb') as csvfile:
        myreader = csv.reader(csvfile, dialect='excel')
        for row in myreader:
            mywriter.writerow(row)

The output is the tab-delimited file amazonia.tab. Now split that into shards for indexing and put them on Google Cloud Storage.

split -l 10000 amazonia.tab amazonia_shard
gsutil -m cp amazonia_shard* gs://vertnet-bigquery/vertnet-latest-amazonia-index2013

When finished, there should be a set of files on Google Cloud Storage at vertnet-bigquery/vertnet-latest-amazonia-index2013 ready for indexing. The Amazonian Biodiversity portal uses index-2014-02-05a. It is best to clear the index before loading the new data, as changes in contributing data sets may not necessarily result in record replacement in the index as they should (this happens if the constructed key changes as a result of changes in the underlying data). Since there are so many contributing resources it is easiest just to clear the whole index.

To clear the index for the Amazonian Biodiversity data store, make sure the index_clean task queue is enabled (see https://appengine.google.com/queues?&app_id=s~vertnet-portal&version_id=indexer.381446997207858979). Normally it is disabled because it is so dangerous. To enable it, the application has to be re-deployed from the branch dwc2013 with the index_clean task enabled in the queue.yaml file where it otherwise is commented out:

# index-clean is dangerous - turn it on only if you really need to
#- name: index-clean
#  rate: 35/s
#  retry_parameters:
#    task_retry_limit: 7
#    task_age_limit: 60m
#    min_backoff_seconds: 30
#    max_backoff_seconds: 960
#    max_doublings: 7

Redeploy to the default application using

appcfg.py update .

Once enabled, the index can be set to be cleared by running the following URL:

http://indexer.vertnet-portal.appspot.com/index-clean?namespace=index-2014-02-05a&index_name=dwc

Alternately, if you know which resources must be cleared, you can do them one at a time using the following syntax:

http://indexer.vertnet-portal.appspot.com/index-delete-resource?resource=cas/cas-herpetology-herp&index_name=dwc&namespace=index-2014-02-05a

You can monitor progress of index cleaning or index resource deletion at the following URL:

https://appengine.google.com/logs?&app_id=s~vertnet-portal&version_id=indexer.381446997207858979

Once it has been cleared, load the index using the VertNet dwc-indexer (https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow). For the Amazonian Biodiversity portal, use the old-style indexer as captured in branch dwc2013 (https://github.com/VertNet/dwc-indexer/tree/dwc2013). This is the version of the application deployed at http://indexer.vertnet-portal.appspot.com/ as of 2015-02-10.

http://indexer.vertnet-portal.appspot.com/index-gcs-path?gcs_path=vertnet-bigquery/vertnet-latest-amazonia-index2013/*&shard_count=50&namespace=index-2014-02-05a&index_name=dwc

Make sure the shard count is as big as the number of files in the folder vertnet-bigquery/vertnet-latest-amazonia-index2013. This one command loads all of the records for Amazonia that were in VertNet BigQuery. There may be other harvested resources that are not in the VertNet BigQuery snapshot, such as NYBG, that need to be indexed as well. This can be done normally using the same URL with the location of the NYBG harvest.

You can monitor the progress of indexing at

http://indexer.vertnet-portal.appspot.com/mapreduce/status