Making Snapshots - VertNet/post-harvest-processor GitHub Wiki

[Page derived from https://github.com/VertNet/bigquery/wiki/Making-Snapshots]

This page describes the process for creating snapshots of VertNet. New snapshots should be made when a new full dump is created.

Note: If the schema for the full dump changes and the change is to be propagated through the vertnet_latest table in BigQuery, then the SELECT statements in this document will have to be amended. If they are, notify AmphibiaWeb of the change, as they rely on the schema for their automated database update.

Prerequisites: A full dump table with name of the form full_yyyymmdd has been made in BigQuery using the bigquery_loader.py script (see https://github.com/VertNet/post-harvest-processor/blob/master/lib/bigquery_loader.py).

Create vertnet_latest Table

This step is unnecessary if the table vertnet_latest is already populated with the latest full dump from the VertNet harvest.

In the BigQuery console, select the full dump table to use as a source. A dropdown button appears next to the table name. Select Copy Table. Enter vertnet_latest as the Table ID. Click on the OK button. When the job is finished running the table vertnet_latest will contain the contents of the source dump table.

Create VertNet Complete Snapshot File

Note: We don't currently make a full VertNet data set.

Prerequisites: The table vertnet_latest is populated with the desired full dump from the VertNet harvest. If not, follow the step "Create vertnet_latest Table".

In the BigQuery console, put the mouse cursor over the vertnet_latest table. An icon will appear that allows you to open a menu. Open the menu and select Export table. Select CSV. Select GZIP. Enter exactly the following for the Google Cloud Storage URI:

gs://vertnet-byclass/vertnet_latest.csv.gz*

This will generate numerous shards of the complete latest VertNet dump on Google Cloud Storage. The job can be monitored in the BigQuery console by selecting Job History. When it is finished the shards on Google Cloud Storage will have to be processed to strip the header row from all but the first shard and concatenate them all into a single file. To do so, run the script process_vertnet_latest.sh on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage.

Create Amphibian Subset Table

This step is unnecessary if the tables vertnet_latest_amphibians is already populated from the latest full dump from the VertNet harvest.

In the BigQuery console, select the table vertnet_latest as a source. Click on Query Table. The SQL statement will be SELECT FROM [dumps.vertnet_latest] LIMIT 1000. Replace this SQL statement with the one below. Two odd things have been done here, the first is to add a column to designate the beginning of the record. This is to facilitate the removal of non-printing characters that may make havoc of the record structure. The second is the renaming of the 'references' field to dc_references to avoid an SQL syntax error in BigQuery.

SELECT
"begin" as beginrecord,
icode,
title,
citation,
contact,
email,
emlrights,
gbifdatasetid,
gbifpublisherid,
doi,
migrator,
networks,
orgcountry,
orgname,
orgstateprovince,
pubdate,
source_url,
iptrecordid,
associatedmedia,
associatedoccurrences,
associatedorganisms,
associatedreferences,
associatedsequences,
associatedtaxa,
bed,
behavior,
catalognumber,
continent,
coordinateprecision,
coordinateuncertaintyinmeters,
country,
countrycode,
county,
dateidentified,
day,
decimallatitude,
decimallongitude,
disposition,
earliestageorloweststage,
earliesteonorlowesteonothem,
earliestepochorlowestseries,
earliesteraorlowesterathem,
earliestperiodorlowestsystem,
enddayofyear,
establishmentmeans,
eventdate,
eventid,
eventremarks,
eventtime,
fieldnotes,
fieldnumber,
footprintspatialfit,
footprintsrs,
footprintwkt,
formation,
geodeticdatum,
geologicalcontextid,
georeferencedby,
georeferenceddate,
georeferenceprotocol,
georeferenceremarks,
georeferencesources,
georeferenceverificationstatus,
group,
habitat,
highergeography,
highergeographyid,
highestbiostratigraphiczone,
identificationid,
identificationqualifier,
identificationreferences,
identificationremarks,
identificationverificationstatus,
identifiedby,
individualcount,
island,
islandgroup,
latestageorhigheststage,
latesteonorhighesteonothem,
latestepochorhighestseries,
latesteraorhighesterathem,
latestperiodorhighestsystem,
lifestage,
lithostratigraphicterms,
locality,
locationaccordingto,
locationid,
locationremarks,
lowestbiostratigraphiczone,
materialsampleid,
maximumdepthinmeters,
maximumdistanceabovesurfaceinmeters,
maximumelevationinmeters,
member,
minimumdepthinmeters,
minimumdistanceabovesurfaceinmeters,
minimumelevationinmeters,
month,
municipality,
occurrenceid,
occurrenceremarks,
occurrencestatus,
organismid,
organismname,
organismremarks,
organismscope,
othercatalognumbers,
pointradiusspatialfit,
preparations,
previousidentifications,
recordedby,
recordnumber,
reproductivecondition,
samplingeffort,
samplingprotocol,
sex,
startdayofyear,
stateprovince,
typestatus,
verbatimcoordinates,
verbatimcoordinatesystem,
verbatimdepth,
verbatimelevation,
verbatimeventdate,
verbatimlatitude,
verbatimlocality,
verbatimlongitude,
verbatimsrs,
waterbody,
year,
dctype,
modified,
language,
license,
rightsholder,
accessrights,
bibliographiccitation,
dumps.vertnet_latest.references as dc_references,
institutionid,
collectionid,
datasetid,
institutioncode,
collectioncode,
datasetname,
ownerinstitutioncode,
basisofrecord,
informationwithheld,
datageneralizations,
dynamicproperties,
scientificnameid,
namepublishedinid,
scientificname,
acceptednameusage,
originalnameusage,
namepublishedin,
namepublishedinyear,
higherclassification,
kingdom,
phylum,
class,
order,
family,
genus,
subgenus,
specificepithet,
infraspecificepithet,
taxonrank,
verbatimtaxonrank,
scientificnameauthorship,
vernacularname,
nomenclaturalcode,
taxonomicstatus,
keyname,
haslicense,
vntype,
rank,
mappable,
hashid,
hastypestatus,
wascaptive,
wasinvasive,
hastissue,
hasmedia,
isfossil,
haslength,
haslifestage,
hasmass,
hassex,
lengthinmm,
massing,
lengthunitsinferred,
massunitsinferred,
underivedlifestage,
underivedsex,
FROM [dumps.vertnet_latest]
WHERE lower(class) like '%amph%'

Click on the Show Options button. Click on the Select Table button. Enter vertnet_latest_amphibians as the Table ID. Click on the OK button.

Select Overwrite Table. Check Allow Large Results. Check Flatten Results. Select Interactive.

Click on the Run Query Button. When the job is finished running the table vertnet_latest_amphibians will have the complete records for amphibians and related taxa. To see which classes are included, run the following query:

SELECT class FROM [dumps.vertnet_latest_amphibians] group by class

Create VertNet Amphibian Snapshot

Prerequisites: The table vertnet_latest_amphibians is populated from the desired full dump from the VertNet harvest. If not, follow the step "Create Amphibian Subset Table".

In the BigQuery console, put the mouse cursor over the vertnet_latest_amphibians table. An icon will appear that allows you to open a menu. Open the menu and select Export table. Select CSV. Select GZIP. Enter exactly the following for the Google Cloud Storage URI:

gs://vertnet-byclass/vertnet_latest_amphibians.csv.gz*

This will generate numerous shards of the latest VertNet amphibians on Google Cloud Storage. The job can be monitored in the BigQuery console by selecting Job History. When it is finished the shards on Google Cloud Storage will have to be processed to strip the header row from all but the first shard and concatenate them all into a single file. To do so, check the script process_vertnet_latest_amphibians-GCS.sh for the number of files to process and run the script on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage.

Create Fishes Subset Table

This step is unnecessary if the tables vertnet_latest_fishes is already populated from the latest full dump from the VertNet harvest.

In the BigQuery console, select the table vertnet_latest as a source. Click on Query Table. The SQL statement will be SELECT FROM [dumps.vertnet_latest] LIMIT 1000. Replace the SQL statement with

SELECT
"begin" as beginrecord,
icode,
title,
citation,
contact,
email,
emlrights,
gbifdatasetid,
gbifpublisherid,
doi,
migrator,
networks,
orgcountry,
orgname,
orgstateprovince,
pubdate,
source_url,
iptrecordid,
associatedmedia,
associatedoccurrences,
associatedorganisms,
associatedreferences,
associatedsequences,
associatedtaxa,
bed,
behavior,
catalognumber,
continent,
coordinateprecision,
coordinateuncertaintyinmeters,
country,
countrycode,
county,
dateidentified,
day,
decimallatitude,
decimallongitude,
disposition,
earliestageorloweststage,
earliesteonorlowesteonothem,
earliestepochorlowestseries,
earliesteraorlowesterathem,
earliestperiodorlowestsystem,
enddayofyear,
establishmentmeans,
eventdate,
eventid,
eventremarks,
eventtime,
fieldnotes,
fieldnumber,
footprintspatialfit,
footprintsrs,
footprintwkt,
formation,
geodeticdatum,
geologicalcontextid,
georeferencedby,
georeferenceddate,
georeferenceprotocol,
georeferenceremarks,
georeferencesources,
georeferenceverificationstatus,
group,
habitat,
highergeography,
highergeographyid,
highestbiostratigraphiczone,
identificationid,
identificationqualifier,
identificationreferences,
identificationremarks,
identificationverificationstatus,
identifiedby,
individualcount,
island,
islandgroup,
latestageorhigheststage,
latesteonorhighesteonothem,
latestepochorhighestseries,
latesteraorhighesterathem,
latestperiodorhighestsystem,
lifestage,
lithostratigraphicterms,
locality,
locationaccordingto,
locationid,
locationremarks,
lowestbiostratigraphiczone,
materialsampleid,
maximumdepthinmeters,
maximumdistanceabovesurfaceinmeters,
maximumelevationinmeters,
member,
minimumdepthinmeters,
minimumdistanceabovesurfaceinmeters,
minimumelevationinmeters,
month,
municipality,
occurrenceid,
occurrenceremarks,
occurrencestatus,
organismid,
organismname,
organismremarks,
organismscope,
othercatalognumbers,
pointradiusspatialfit,
preparations,
previousidentifications,
recordedby,
recordnumber,
reproductivecondition,
samplingeffort,
samplingprotocol,
sex,
startdayofyear,
stateprovince,
typestatus,
verbatimcoordinates,
verbatimcoordinatesystem,
verbatimdepth,
verbatimelevation,
verbatimeventdate,
verbatimlatitude,
verbatimlocality,
verbatimlongitude,
verbatimsrs,
waterbody,
year,
dctype,
modified,
language,
license,
rightsholder,
accessrights,
bibliographiccitation,
dumps.vertnet_latest.references as dc_references,
institutionid,
collectionid,
datasetid,
institutioncode,
collectioncode,
datasetname,
ownerinstitutioncode,
basisofrecord,
informationwithheld,
datageneralizations,
dynamicproperties,
scientificnameid,
namepublishedinid,
scientificname,
acceptednameusage,
originalnameusage,
namepublishedin,
namepublishedinyear,
higherclassification,
kingdom,
phylum,
class,
order,
family,
genus,
subgenus,
specificepithet,
infraspecificepithet,
taxonrank,
verbatimtaxonrank,
scientificnameauthorship,
vernacularname,
nomenclaturalcode,
taxonomicstatus,
keyname,
haslicense,
vntype,
rank,
mappable,
hashid,
hastypestatus,
wascaptive,
wasinvasive,
hastissue,
hasmedia,
isfossil,
haslength,
haslifestage,
hasmass,
hassex,
lengthinmm,
massing,
lengthunitsinferred,
massunitsinferred,
underivedlifestage,
underivedsex,
FROM [dumps.vertnet_latest]
WHERE
lower(class)='acanthodii' OR
lower(class) LIKE '%pteryg%' OR
lower(class)='agnatha' OR
lower(class) LIKE '%cephal%' OR
lower(class) LIKE '%ichthy%' OR
lower(class)='dipnoi' OR
lower(class) LIKE '%elasmobr%' OR
lower(class)='leptocardii' OR
lower(class)='myxini' OR
lower(class)='pisces' OR
lower(class)='galeaspida' OR
lower(class)='hyperoartia' OR
lower(class)='osteosctraci' OR
lower(class)='petromyzontidae' OR
lower(class)='pteraspidomorphi' OR
lower(class)='placodermi'

Click on the Show Options button. Click on the Select Table button. Enter vertnet_latest_fishes as the Table ID. Click on the OK button.

Select Overwrite Table. Check Allow Large Results. Check Flatten Results. Select Interactive.

Click on the Run Query Button. When the job is finished running the table vertnet_latest_fishes will have the complete records for fishes and related taxa.

Create VertNet Fishes Snapshot

Prerequisites: The table vertnet_latest_fishes is populated from the desired full dump from the VertNet harvest. If not, follow the step "Create Fishes Subset Table".

In the BigQuery console, put the mouse cursor over the vertnet_latest_fishes table. An icon will appear that allows you to open a menu. Open the menu and select Export table. Select CSV. Select GZIP. Enter exactly the following for the Google Cloud Storage URI:

gs://vertnet-byclass/vertnet_latest_fishes.csv.gz*

This will generate numerous shards of the latest VertNet amphibians on Google Cloud Storage. The job can be monitored in the BigQuery console by selecting Job History. When it is finished the shards on Google Cloud Storage will have to be processed to strip the header row from all but the first shard and concatenate them all into a single file. To do so, check the script process_vertnet_latest_fishes-GCS.sh for the number of files to process and run the script on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage.

Create Birds Subset Table

This step is unnecessary if the tables vertnet_latest_birds is already populated from the latest full dump from the VertNet harvest.

In the BigQuery console, select the table vertnet_latest as a source. Click on Query Table. The SQL statement will be SELECT FROM [dumps.vertnet_latest] LIMIT 1000. Replace the SQL statement with

SELECT
"begin" as beginrecord,
icode,
title,
citation,
contact,
email,
emlrights,
gbifdatasetid,
gbifpublisherid,
doi,
migrator,
networks,
orgcountry,
orgname,
orgstateprovince,
pubdate,
source_url,
iptrecordid,
associatedmedia,
associatedoccurrences,
associatedorganisms,
associatedreferences,
associatedsequences,
associatedtaxa,
bed,
behavior,
catalognumber,
continent,
coordinateprecision,
coordinateuncertaintyinmeters,
country,
countrycode,
county,
dateidentified,
day,
decimallatitude,
decimallongitude,
disposition,
earliestageorloweststage,
earliesteonorlowesteonothem,
earliestepochorlowestseries,
earliesteraorlowesterathem,
earliestperiodorlowestsystem,
enddayofyear,
establishmentmeans,
eventdate,
eventid,
eventremarks,
eventtime,
fieldnotes,
fieldnumber,
footprintspatialfit,
footprintsrs,
footprintwkt,
formation,
geodeticdatum,
geologicalcontextid,
georeferencedby,
georeferenceddate,
georeferenceprotocol,
georeferenceremarks,
georeferencesources,
georeferenceverificationstatus,
group,
habitat,
highergeography,
highergeographyid,
highestbiostratigraphiczone,
identificationid,
identificationqualifier,
identificationreferences,
identificationremarks,
identificationverificationstatus,
identifiedby,
individualcount,
island,
islandgroup,
latestageorhigheststage,
latesteonorhighesteonothem,
latestepochorhighestseries,
latesteraorhighesterathem,
latestperiodorhighestsystem,
lifestage,
lithostratigraphicterms,
locality,
locationaccordingto,
locationid,
locationremarks,
lowestbiostratigraphiczone,
materialsampleid,
maximumdepthinmeters,
maximumdistanceabovesurfaceinmeters,
maximumelevationinmeters,
member,
minimumdepthinmeters,
minimumdistanceabovesurfaceinmeters,
minimumelevationinmeters,
month,
municipality,
occurrenceid,
occurrenceremarks,
occurrencestatus,
organismid,
organismname,
organismremarks,
organismscope,
othercatalognumbers,
pointradiusspatialfit,
preparations,
previousidentifications,
recordedby,
recordnumber,
reproductivecondition,
samplingeffort,
samplingprotocol,
sex,
startdayofyear,
stateprovince,
typestatus,
verbatimcoordinates,
verbatimcoordinatesystem,
verbatimdepth,
verbatimelevation,
verbatimeventdate,
verbatimlatitude,
verbatimlocality,
verbatimlongitude,
verbatimsrs,
waterbody,
year,
dctype,
modified,
language,
license,
rightsholder,
accessrights,
bibliographiccitation,
dumps.vertnet_latest.references as dc_references,
institutionid,
collectionid,
datasetid,
institutioncode,
collectioncode,
datasetname,
ownerinstitutioncode,
basisofrecord,
informationwithheld,
datageneralizations,
dynamicproperties,
scientificnameid,
namepublishedinid,
scientificname,
acceptednameusage,
originalnameusage,
namepublishedin,
namepublishedinyear,
higherclassification,
kingdom,
phylum,
class,
order,
family,
genus,
subgenus,
specificepithet,
infraspecificepithet,
taxonrank,
verbatimtaxonrank,
scientificnameauthorship,
vernacularname,
nomenclaturalcode,
taxonomicstatus,
keyname,
haslicense,
vntype,
rank,
mappable,
hashid,
hastypestatus,
wascaptive,
wasinvasive,
hastissue,
hasmedia,
isfossil,
haslength,
haslifestage,
hasmass,
hassex,
lengthinmm,
massing,
lengthunitsinferred,
massunitsinferred,
underivedlifestage,
underivedsex,
FROM [dumps.vertnet_latest]
WHERE lower(class) like 'av%' OR
lower(class) like '%bird%'

Click on the Show Options button. Click on the Select Table button. Enter vertnet_latest_birds as the Table ID. Click on the OK button.

Select Overwrite Table. Check Allow Large Results. Check Flatten Results. Select Interactive.

Click on the Run Query Button. When the job is finished running the table vertnet_latest_birds will have the complete records for birds.

Create VertNet Birds Snapshot

Prerequisites: The table vertnet_latest_birds is populated from the desired full dump from the VertNet harvest. If not, follow the step "Create Birds Subset Table".

In the BigQuery console, put the mouse cursor over the vertnet_latest_birds table. An icon will appear that allows you to open a menu. Open the menu and select Export table. Select CSV. Select GZIP. Enter exactly the following for the Google Cloud Storage URI:

gs://vertnet-byclass/vertnet_latest_birds.csv.gz*

This will generate numerous shards of the latest VertNet birds on Google Cloud Storage. The job can be monitored in the BigQuery console by selecting Job History. When it is finished the shards on Google Cloud Storage will have to be processed to strip the header row from all but the first shard and concatenate them all into a single file. To do so, check the script process_vertnet_latest_birds-GCS.sh for the number of files to process and run the script on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage.

Create Mammals Subset Table

This step is unnecessary if the tables vertnet_latest_mammals is already populated from the latest full dump from the VertNet harvest.

In the BigQuery console, select the table vertnet_latest as a source. Click on Query Table. The SQL statement will be SELECT FROM [dumps.vertnet_latest] LIMIT 1000. Replace the SQL statement with

SELECT
"begin" as beginrecord,
icode,
title,
citation,
contact,
email,
emlrights,
gbifdatasetid,
gbifpublisherid,
doi,
migrator,
networks,
orgcountry,
orgname,
orgstateprovince,
pubdate,
source_url,
iptrecordid,
associatedmedia,
associatedoccurrences,
associatedorganisms,
associatedreferences,
associatedsequences,
associatedtaxa,
bed,
behavior,
catalognumber,
continent,
coordinateprecision,
coordinateuncertaintyinmeters,
country,
countrycode,
county,
dateidentified,
day,
decimallatitude,
decimallongitude,
disposition,
earliestageorloweststage,
earliesteonorlowesteonothem,
earliestepochorlowestseries,
earliesteraorlowesterathem,
earliestperiodorlowestsystem,
enddayofyear,
establishmentmeans,
eventdate,
eventid,
eventremarks,
eventtime,
fieldnotes,
fieldnumber,
footprintspatialfit,
footprintsrs,
footprintwkt,
formation,
geodeticdatum,
geologicalcontextid,
georeferencedby,
georeferenceddate,
georeferenceprotocol,
georeferenceremarks,
georeferencesources,
georeferenceverificationstatus,
group,
habitat,
highergeography,
highergeographyid,
highestbiostratigraphiczone,
identificationid,
identificationqualifier,
identificationreferences,
identificationremarks,
identificationverificationstatus,
identifiedby,
individualcount,
island,
islandgroup,
latestageorhigheststage,
latesteonorhighesteonothem,
latestepochorhighestseries,
latesteraorhighesterathem,
latestperiodorhighestsystem,
lifestage,
lithostratigraphicterms,
locality,
locationaccordingto,
locationid,
locationremarks,
lowestbiostratigraphiczone,
materialsampleid,
maximumdepthinmeters,
maximumdistanceabovesurfaceinmeters,
maximumelevationinmeters,
member,
minimumdepthinmeters,
minimumdistanceabovesurfaceinmeters,
minimumelevationinmeters,
month,
municipality,
occurrenceid,
occurrenceremarks,
occurrencestatus,
organismid,
organismname,
organismremarks,
organismscope,
othercatalognumbers,
pointradiusspatialfit,
preparations,
previousidentifications,
recordedby,
recordnumber,
reproductivecondition,
samplingeffort,
samplingprotocol,
sex,
startdayofyear,
stateprovince,
typestatus,
verbatimcoordinates,
verbatimcoordinatesystem,
verbatimdepth,
verbatimelevation,
verbatimeventdate,
verbatimlatitude,
verbatimlocality,
verbatimlongitude,
verbatimsrs,
waterbody,
year,
dctype,
modified,
language,
license,
rightsholder,
accessrights,
bibliographiccitation,
dumps.vertnet_latest.references as dc_references,
institutionid,
collectionid,
datasetid,
institutioncode,
collectioncode,
datasetname,
ownerinstitutioncode,
basisofrecord,
informationwithheld,
datageneralizations,
dynamicproperties,
scientificnameid,
namepublishedinid,
scientificname,
acceptednameusage,
originalnameusage,
namepublishedin,
namepublishedinyear,
higherclassification,
kingdom,
phylum,
class,
order,
family,
genus,
subgenus,
specificepithet,
infraspecificepithet,
taxonrank,
verbatimtaxonrank,
scientificnameauthorship,
vernacularname,
nomenclaturalcode,
taxonomicstatus,
keyname,
haslicense,
vntype,
rank,
mappable,
hashid,
hastypestatus,
wascaptive,
wasinvasive,
hastissue,
hasmedia,
isfossil,
haslength,
haslifestage,
hasmass,
hassex,
lengthinmm,
massing,
lengthunitsinferred,
massunitsinferred,
underivedlifestage,
underivedsex,
FROM [dumps.vertnet_latest]
WHERE lower(class) like '%mam%'

Click on the Show Options button. Click on the Select Table button. Enter vertnet_latest_mammals as the Table ID. Click on the OK button.

Select Overwrite Table. Check Allow Large Results. Check Flatten Results. Select Interactive.

Click on the Run Query Button. When the job is finished running the table vertnet_latest_mammals will have the complete records for mammals.

Create VertNet Mammals Snapshot

Prerequisites: The table vertnet_latest_mammals is populated from the desired full dump from the VertNet harvest. If not, follow the step "Create Mammals Subset Table".

In the BigQuery console, put the mouse cursor over the vertnet_latest_mammals table. An icon will appear that allows you to open a menu. Open the menu and select Export table. Select CSV. Select GZIP. Enter exactly the following for the Google Cloud Storage URI:

gs://vertnet-byclass/vertnet_latest_mammals.csv.gz*

This will generate numerous shards of the latest VertNet mammals on Google Cloud Storage. The job can be monitored in the BigQuery console by selecting Job History. When it is finished the shards on Google Cloud Storage will have to be processed to strip the header row from all but the first shard and concatenate them all into a single file. To do so, check the script process_vertnet_latest_mammals-GCS.sh for the number of files to process and run the script on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage.

Create Reptiles Subset Table

This step is unnecessary if the tables vertnet_latest_reptiles is already populated from the latest full dump from the VertNet harvest.

In the BigQuery console, select the table vertnet_latest as a source. Click on Query Table. The SQL statement will be SELECT FROM [dumps.vertnet_latest] LIMIT 1000. Replace the SQL statement with

SELECT
"begin" as beginrecord,
icode,
title,
citation,
contact,
email,
emlrights,
gbifdatasetid,
gbifpublisherid,
doi,
migrator,
networks,
orgcountry,
orgname,
orgstateprovince,
pubdate,
source_url,
iptrecordid,
associatedmedia,
associatedoccurrences,
associatedorganisms,
associatedreferences,
associatedsequences,
associatedtaxa,
bed,
behavior,
catalognumber,
continent,
coordinateprecision,
coordinateuncertaintyinmeters,
country,
countrycode,
county,
dateidentified,
day,
decimallatitude,
decimallongitude,
disposition,
earliestageorloweststage,
earliesteonorlowesteonothem,
earliestepochorlowestseries,
earliesteraorlowesterathem,
earliestperiodorlowestsystem,
enddayofyear,
establishmentmeans,
eventdate,
eventid,
eventremarks,
eventtime,
fieldnotes,
fieldnumber,
footprintspatialfit,
footprintsrs,
footprintwkt,
formation,
geodeticdatum,
geologicalcontextid,
georeferencedby,
georeferenceddate,
georeferenceprotocol,
georeferenceremarks,
georeferencesources,
georeferenceverificationstatus,
group,
habitat,
highergeography,
highergeographyid,
highestbiostratigraphiczone,
identificationid,
identificationqualifier,
identificationreferences,
identificationremarks,
identificationverificationstatus,
identifiedby,
individualcount,
island,
islandgroup,
latestageorhigheststage,
latesteonorhighesteonothem,
latestepochorhighestseries,
latesteraorhighesterathem,
latestperiodorhighestsystem,
lifestage,
lithostratigraphicterms,
locality,
locationaccordingto,
locationid,
locationremarks,
lowestbiostratigraphiczone,
materialsampleid,
maximumdepthinmeters,
maximumdistanceabovesurfaceinmeters,
maximumelevationinmeters,
member,
minimumdepthinmeters,
minimumdistanceabovesurfaceinmeters,
minimumelevationinmeters,
month,
municipality,
occurrenceid,
occurrenceremarks,
occurrencestatus,
organismid,
organismname,
organismremarks,
organismscope,
othercatalognumbers,
pointradiusspatialfit,
preparations,
previousidentifications,
recordedby,
recordnumber,
reproductivecondition,
samplingeffort,
samplingprotocol,
sex,
startdayofyear,
stateprovince,
typestatus,
verbatimcoordinates,
verbatimcoordinatesystem,
verbatimdepth,
verbatimelevation,
verbatimeventdate,
verbatimlatitude,
verbatimlocality,
verbatimlongitude,
verbatimsrs,
waterbody,
year,
dctype,
modified,
language,
license,
rightsholder,
accessrights,
bibliographiccitation,
dumps.vertnet_latest.references as dc_references,
institutionid,
collectionid,
datasetid,
institutioncode,
collectioncode,
datasetname,
ownerinstitutioncode,
basisofrecord,
informationwithheld,
datageneralizations,
dynamicproperties,
scientificnameid,
namepublishedinid,
scientificname,
acceptednameusage,
originalnameusage,
namepublishedin,
namepublishedinyear,
higherclassification,
kingdom,
phylum,
class,
order,
family,
genus,
subgenus,
specificepithet,
infraspecificepithet,
taxonrank,
verbatimtaxonrank,
scientificnameauthorship,
vernacularname,
nomenclaturalcode,
taxonomicstatus,
keyname,
haslicense,
vntype,
rank,
mappable,
hashid,
hastypestatus,
wascaptive,
wasinvasive,
hastissue,
hasmedia,
isfossil,
haslength,
haslifestage,
hasmass,
hassex,
lengthinmm,
massing,
lengthunitsinferred,
massunitsinferred,
underivedlifestage,
underivedsex,
FROM [dumps.vertnet_latest]
WHERE lower(class) like '%rept%'

Click on the Show Options button. Click on the Select Table button. Enter vertnet_latest_reptiles as the Table ID. Click on the OK button.

Select Overwrite Table. Check Allow Large Results. Check Flatten Results. Select Interactive.

Click on the Run Query Button. When the job is finished running the table vertnet_latest_reptiles will have the complete records for reptiles.

Create VertNet Reptiles Snapshot

Prerequisites: The table vertnet_latest_reptiles is populated from the desired full dump from the VertNet harvest. If not, follow the step "Create Reptiles Subset Table".

In the BigQuery console, put the mouse cursor over the vertnet_latest_reptiles table. An icon will appear that allows you to open a menu. Open the menu and select Export table. Select CSV. Select GZIP. Enter exactly the following for the Google Cloud Storage URI:

gs://vertnet-byclass/vertnet_latest_reptiles.csv.gz*

This will generate numerous shards of the latest VertNet reptiles on Google Cloud Storage. The job can be monitored in the BigQuery console by selecting Job History. When it is finished the shards on Google Cloud Storage will have to be processed to strip the header row from all but the first shard and concatenate them all into a single file. To do so, check the script process_vertnet_latest_reptiles-GCS.sh for the number of files to process and run the script on a machine that is configured with gsutil and authorized to access the VertNet Google Cloud Storage.