Index List - VertNet/dwc-indexer GitHub Wiki

Index Workflow Wiki: https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow

Up to date information about a given index can be found with

http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=[index namespace]

For example:

http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-11a

Namespace: index-2013-08-08 (ACTIVE: ATOMized version of index with traits since 2016-07-26)

As a result, the schema for the index is now: Schema: { u'basisofrecord': ['ATOM'], u'bed': ['TEXT'], u'catalognumber': ['TEXT'], u'class': ['TEXT', 'ATOM'], u'collectioncode': ['TEXT'], u'collectorname': ['TEXT'], u'continent': ['TEXT'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'country': ['TEXT'], u'county': ['TEXT'], u'day': ['NUMBER'], u'dctype': ['ATOM'], u'enddayofyear': ['NUMBER'] u'establishmentmeans': ['TEXT'], u'eventdate': ['DATE', 'TEXT'], u'family': ['ATOM', 'TEXT'], u'fieldnumber': ['TEXT'], u'formation': ['TEXT'], u'gbifdatasetid': ['ATOM'], u'gbifpublisherid': ['ATOM'], u'genus': ['TEXT', 'ATOM'], u'geodeticdatum': ['TEXT'], u'georeferencedby': ['TEXT'], u'georeferenceverificationstatus': ['TEXT'], u'group': ['TEXT'], u'hashid': ['NUMBER'], u'haslength': ['ATOM'], u'haslicense': ['ATOM'], u'haslifestage': ['ATOM'], u'hasmass': ['ATOM'], u'hasmedia': ['ATOM'], u'hassex': ['ATOM'], u'hastissue': ['ATOM'], u'hastypestatus': ['ATOM'], u'infraspecificepithet': ['TEXT'], u'institutioncode': ['TEXT'], u'iptrecordid': ['ATOM'], u'isfossil': ['ATOM'], u'island': ['TEXT'], u'islandgroup': ['TEXT'], u'kingdom': ['ATOM', 'TEXT'], u'lastindexed': ['TEXT'], u'lengthinmm': ['NUMBER'], u'license': ['TEXT'], u'lifestage': ['TEXT'], u'locality': ['TEXT'], u'location': ['GEO_POINT'], u'mappable': ['NUMBER', 'ATOM'], u'massing': ['NUMBER'], u'media': ['NUMBER'], u'member': ['TEXT'], u'migrator': ['TEXT'], u'month': ['NUMBER'], u'municipality': ['TEXT'], u'networks': ['TEXT'], u'order': ['ATOM', 'TEXT'], u'orgcountry': ['TEXT'], u'orgstateprovince': ['TEXT'], u'phylum': ['ATOM', 'TEXT'], u'preparations': ['TEXT'], u'rank': ['NUMBER'], u'record': ['TEXT'], u'recordedby': ['TEXT'], u'recordnumber': ['TEXT'], u'reproductivecondition': ['TEXT'], u'scientificname': ['TEXT'], u'sex': ['TEXT'], u'specificepithet': ['TEXT'], u'startdayofyear': ['NUMBER'], u'stateprovince': ['TEXT'], u'tissue': ['NUMBER'], u'type': ['TEXT'], u'typestatus': ['TEXT'], u'url': ['TEXT'], u'verbatim_record': ['TEXT'], u'vernacularname': ['TEXT'], u'vntype': ['ATOM'], u'wascaptive': ['ATOM'], u'wasinvasive': ['ATOM'], u'waterbody': ['TEXT'], u'year': ['TEXT', 'NUMBER'], }

Namespace: index-2014-02-05a (ACTIVE: http://amazoniabiodiversity.vertnet-portal.appspot.com/ as of 2014-12-22)

Namespace: index-2014-02-11a (INACTIVE: Production index for VN portal from 2014-12-22 to 2016-07-26)

Comments: Was originally a 10G index. Quota increased by Google. Was index-cleaned, then records loaded for testing and found responsive.

Schema: {u'family': ['TEXT'], u'stateprovince': ['ATOM', 'TEXT'], u'hastypestatus': ['NUMBER'], u'rank': ['NUMBER'], u'county': ['TEXT'], u'tissue': ['NUMBER'], u'year': ['TEXT'], u'specificepithet': ['TEXT'], u'media': ['NUMBER'], u'institutioncode': ['TEXT'], u'class': ['TEXT'], u'location': ['GEO_POINT'], u'collectorname': ['TEXT'], u'type': ['TEXT'], u'recordedby': ['TEXT'], u'verbatim_record': ['TEXT'], u'catalognumber': ['TEXT'], u'url': ['TEXT'], u'country': ['ATOM', 'TEXT'], u'mappable': ['NUMBER'], u'record': ['TEXT'], u'genus': ['TEXT'], u'eventdate': ['DATE']}

Namespace: index-2014-02-11 (INACTIVE: Shows as empty. Test before using.)

  • http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-11
  • https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-02-11
  • Name: dwc
  • Date: 2014-03-26 11:26
  • Storage usage: 32023049013
  • Storage limit: 268435456000
  • Original limit: 268435456000
  • Status: responsive
  • 2014-03-26 15:39:02.603, started index-clean to clean out the index for re-use since index-2014-03-12 failed with quota errors at around 3M records again.
  • 2014-03-26 22:44:40.281 Finished index-clean on index index-2014-02-11.dwc. Removed 6254379 documents.
  • 2015-05-22 Search shows no records in index. "No documents meet these criteria."
  • Date: 2016-04-15 Have clean index before load.
  • Storage usage: 0
  • Storage limit: 268435456000
  • Status: responsive
  • 2016-07-26 Retains residual records (~8.5M) of an aborted index, one that was replaced by the Atomized version (index-2013-08-08) to improve performance.
  • 2016-08-07 Removed 4728811 documents to clear the index. Unclear if it is functional now. Check before using.

Namespace: index-2014-03-12 (INACTIVE: Shows as empty. Test before using.)

Comments: First attempt to load resulted in quota overrun at 100% capacity of the 10G originally granted. Quota increased to 250G, but loading still had quota overrun for a couple of days. Once records could be loaded again without quota overrun, cleaned the 3038934 records. Redesigned index, then started loading again 25 Mar 2014 with largest data sets first. Loaded somewhere in the neighborhood of 3M records before emitting quota errors again, but these where errors based on the document inserts per minute, not the storage_quota for the index. Continued to load the index more conservatively, with no more than a couple of indexer jobs running simultaneously.

  • Schema: { u'catalognumber': ['TEXT'], u'class': ['TEXT'], u'continent': ['TEXT'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'country': ['TEXT'], u'county': ['TEXT'], u'eventdate': ['DATE'] u'family': ['TEXT'], u'fossil': ['NUMBER'], u'genus': ['TEXT'], u'hashid': ['NUMBER'], u'hastypestatus': ['NUMBER'], u'institutioncode': ['TEXT'], u'island': ['TEXT'], u'islandgroup': ['TEXT'], u'location': ['GEO_POINT'], u'mappable': ['NUMBER'], u'media': ['NUMBER'], u'occurrenceid': ['TEXT'], u'order': ['TEXT'], u'pubdate': ['TEXT'], u'rank': ['NUMBER'], u'record': ['TEXT'], u'recordedby': ['TEXT'], u'resource': ['TEXT'], u'specificepithet': ['TEXT'], u'stateprovince': ['TEXT'], u'tissue': ['NUMBER'], u'type': ['TEXT'], u'url': ['TEXT'], u'verbatim_record': ['TEXT'], u'year': ['TEXT', 'NUMBER'], }

  • Date: 2016-07-19 13:16T+02:00

  • Storage usage: 80416361337

  • Storage limit: 268435456000

  • Original limit: 10737418240 Schema: { u'catalognumber': ['TEXT'], u'class': ['TEXT'], u'continent': ['TEXT'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'country': ['TEXT'], u'county': ['TEXT'], u'eventdate': ['DATE'] u'family': ['TEXT'], u'fossil': ['NUMBER'], u'genus': ['TEXT'], u'hashid': ['NUMBER'], u'hastypestatus': ['NUMBER'], u'institutioncode': ['TEXT'], u'island': ['TEXT'], u'islandgroup': ['TEXT'], u'location': ['GEO_POINT'], u'mappable': ['NUMBER'], u'media': ['NUMBER'], u'occurrenceid': ['TEXT'], u'order': ['TEXT'], u'pubdate': ['TEXT'], u'rank': ['NUMBER'], u'record': ['TEXT'], u'recordedby': ['TEXT'], u'resource': ['TEXT'], u'specificepithet': ['TEXT'], u'stateprovince': ['TEXT'], u'tissue': ['NUMBER'], u'type': ['TEXT'], u'url': ['TEXT'], u'verbatim_record': ['TEXT'], u'year': ['TEXT', 'NUMBER'], } Emptying contents on 2016-07-19 to make it reusable. Old index is definitively no longer needed.

  • 2016-07-26 Index empty.

Namespace: index-2014-02-06 (INACTIVE: Shows as empty. Test before using.)

Comments: Was 5.3% full with 14324192556L usage. index-cleaned but not responsive. Here is the final output from the cleaning run: 2014-03-21 08:07:16.902 /index-clean 200 736ms 27kb AppEngine-Google; (+http://code.google.com/appengine) module=default version=indexer 0.1.0.2 - - [21/Mar/2014:04:07:16 -0700] "POST /index-clean HTTP/1.1" 200 28408 "http://indexer.vertnet-portal.appspot.com/index-clean" "AppEngine-Google; (+http://code.google.com/appengine)" "indexer.vertnet-portal.appspot.com" ms=736 cpu_ms=86 cpm_usd=4.013175 queue_name=index-clean task_name=2572243230419474568 pending_ms=20 app_engine_release=1.9.1 instance=00c61b117cb0ac3e476edb20e488397bef46c4 I 2014-03-21 08:07:16.898 Queuing index-clean task with params {'ndeleted': 8335200, 'max_delete': u'', 'namespace': u'index-2014-02-06', 'index_name': u'dwc', 'id': u'university-of-texas-at-arlington-amphibian-and-reptile-diversity-research-center/uta-herpetology/ffefa851-4c5f-4322-a8ce-6eaa23bd7e04', 'batch_size': u''} 2014-03-21 08:07:18.031 /index-clean 200 1083ms 4kb AppEngine-Google; (+http://code.google.com/appengine) module=default version=indexer 0.1.0.2 - - [21/Mar/2014:04:07:18 -0700] "POST /index-clean HTTP/1.1" 200 4155 "http://indexer.vertnet-portal.appspot.com/index-clean" "AppEngine-Google; (+http://code.google.com/appengine)" "indexer.vertnet-portal.appspot.com" ms=1084 cpu_ms=21 cpm_usd=0.560464 queue_name=index-clean task_name=10355228373950732507 app_engine_release=1.9.1 instance=00c61b117cb0ac3e476edb20e488397bef46c4 I 2014-03-21 08:07:18.030 Finished index-clean on index index-2014-02-06.dwc. Removed 8335228 documents.

Namespace: index-2014-01-10 (INACTIVE: Shows as having one record 8352 bytes. Test before using.)

Namespace: index-2013-08-11 (INACTIVE: small, empty)

Schema: {u'haslifestage': ['ATOM'], u'family': ['TEXT', 'ATOM'], u'wasinvasive': ['ATOM'], u'stateprovince': ['TEXT'], u'hastypestatus': ['ATOM'], u'municipality': ['TEXT'], u'lifestage': ['TEXT'], u'rank': ['NUMBER'], u'infraspecificepithet': ['TEXT'], u'county': ['TEXT'], u'networks': ['TEXT'], u'phylum': ['TEXT', 'ATOM'], u'lastindexed': ['TEXT'], u'haslength': ['ATOM'], u'georeferencedby': ['TEXT'], u'catalognumber': ['TEXT'], u'startdayofyear': ['NUMBER'], u'wascaptive': ['ATOM'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'continent': ['TEXT'], u'recordedby': ['TEXT'], u'hasmass': ['ATOM'], u'specificepithet': ['TEXT'], u'vntype': ['ATOM'], u'group': ['TEXT'], u'preparations': ['TEXT'], u'basisofrecord': ['ATOM'], u'year': ['NUMBER'], u'hasmedia': ['ATOM'], u'geodeticdatum': ['TEXT'], u'orgstateprovince': ['TEXT'], u'day': ['NUMBER'], u'member': ['TEXT'], u'mappable': ['ATOM'], u'typestatus': ['TEXT'], u'location': ['GEO_POINT'], u'hastissue': ['ATOM'], u'order': ['TEXT', 'ATOM'], u'recordnumber': ['TEXT'], u'migrator': ['TEXT'], u'kingdom': ['TEXT', 'ATOM'], u'islandgroup': ['TEXT'], u'reproductivecondition': ['TEXT'], u'orgcountry': ['TEXT'], u'institutioncode': ['TEXT'], u'dctype': ['ATOM'], u'iptrecordid': ['ATOM'], u'formation': ['TEXT'], u'locality': ['TEXT'], u'gbifpublisherid': ['ATOM'], u'waterbody': ['TEXT'], u'hashid': ['NUMBER'], u'month': ['NUMBER'], u'verbatim_record': ['TEXT'], u'class': ['TEXT', 'ATOM'], u'vernacularname': ['TEXT'], u'isfossil': ['ATOM'], u'hassex': ['ATOM'], u'license': ['TEXT'], u'country': ['TEXT'], u'lengthinmm': ['NUMBER'], u'georeferenceverificationstatus': ['TEXT'], u'sex': ['TEXT'], u'collectioncode': ['TEXT'], u'bed': ['TEXT'], u'establishmentmeans': ['TEXT'], u'haslicense': ['ATOM'], u'fieldnumber': ['TEXT'], u'island': ['TEXT'], u'scientificname': ['TEXT'], u'genus': ['TEXT', 'ATOM'], u'gbifdatasetid': ['ATOM'], u'eventdate': ['TEXT'], u'enddayofyear': ['NUMBER']}

Namespace: index-2014-02-06t (INACTIVE: small, empty)

Namespace: index-2014-02-06t2 (INACTIVE: small, empty)

Namespace: index000001 (INACTIVE: Shows as having one record 8352 bytes. Test before using.)

Namespace: (None) dwc_search (INACTIVE: small, empty)