Index List - VertNet/dwc-indexer GitHub Wiki
Index Workflow Wiki: https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow
Up to date information about a given index can be found with
http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=[index namespace]
For example:
http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-11a
Namespace: index-2013-08-08 (ACTIVE: ATOMized version of index with traits since 2016-07-26)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2013-08-08
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2013-08-08
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 32455111726 (~8.5M documents)
- Storage usage: 75782216627 (~19.6M documents 2016-07-26)
- Storage limit: 268435456000
- Original limit: 268435456000
- Usage: 12.1%
- Status: responsive
- 2015-05-22 Found to have documents. Ran index-clean. Removed 8196723 documents. Search after does not complete.
- 2016-07-19 No documents in index. Started adding documents to test performance of fields that could be made Atom fields to address this issue: https://github.com/VertNet/dwc-indexer/issues/20. See https://github.com/VertNet/dwc-indexer/blob/traiter-atoms/index_utils.py#L141 for the structure of the document.
- 2016-07-22 Index loading continues. Performance on "boolean" fields is consistently under about 10s. Space taken by index appears to be about 30% less than the same records without the ATOM fields.
- 2016-07-26 Indexing complete. Performance holds up well. ~13s to retrieve first 10k records matching haslifestage:1 hassex:1 hasmass:1 haslength:1 hastissue:1.
As a result, the schema for the index is now: Schema: { u'basisofrecord': ['ATOM'], u'bed': ['TEXT'], u'catalognumber': ['TEXT'], u'class': ['TEXT', 'ATOM'], u'collectioncode': ['TEXT'], u'collectorname': ['TEXT'], u'continent': ['TEXT'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'country': ['TEXT'], u'county': ['TEXT'], u'day': ['NUMBER'], u'dctype': ['ATOM'], u'enddayofyear': ['NUMBER'] u'establishmentmeans': ['TEXT'], u'eventdate': ['DATE', 'TEXT'], u'family': ['ATOM', 'TEXT'], u'fieldnumber': ['TEXT'], u'formation': ['TEXT'], u'gbifdatasetid': ['ATOM'], u'gbifpublisherid': ['ATOM'], u'genus': ['TEXT', 'ATOM'], u'geodeticdatum': ['TEXT'], u'georeferencedby': ['TEXT'], u'georeferenceverificationstatus': ['TEXT'], u'group': ['TEXT'], u'hashid': ['NUMBER'], u'haslength': ['ATOM'], u'haslicense': ['ATOM'], u'haslifestage': ['ATOM'], u'hasmass': ['ATOM'], u'hasmedia': ['ATOM'], u'hassex': ['ATOM'], u'hastissue': ['ATOM'], u'hastypestatus': ['ATOM'], u'infraspecificepithet': ['TEXT'], u'institutioncode': ['TEXT'], u'iptrecordid': ['ATOM'], u'isfossil': ['ATOM'], u'island': ['TEXT'], u'islandgroup': ['TEXT'], u'kingdom': ['ATOM', 'TEXT'], u'lastindexed': ['TEXT'], u'lengthinmm': ['NUMBER'], u'license': ['TEXT'], u'lifestage': ['TEXT'], u'locality': ['TEXT'], u'location': ['GEO_POINT'], u'mappable': ['NUMBER', 'ATOM'], u'massing': ['NUMBER'], u'media': ['NUMBER'], u'member': ['TEXT'], u'migrator': ['TEXT'], u'month': ['NUMBER'], u'municipality': ['TEXT'], u'networks': ['TEXT'], u'order': ['ATOM', 'TEXT'], u'orgcountry': ['TEXT'], u'orgstateprovince': ['TEXT'], u'phylum': ['ATOM', 'TEXT'], u'preparations': ['TEXT'], u'rank': ['NUMBER'], u'record': ['TEXT'], u'recordedby': ['TEXT'], u'recordnumber': ['TEXT'], u'reproductivecondition': ['TEXT'], u'scientificname': ['TEXT'], u'sex': ['TEXT'], u'specificepithet': ['TEXT'], u'startdayofyear': ['NUMBER'], u'stateprovince': ['TEXT'], u'tissue': ['NUMBER'], u'type': ['TEXT'], u'typestatus': ['TEXT'], u'url': ['TEXT'], u'verbatim_record': ['TEXT'], u'vernacularname': ['TEXT'], u'vntype': ['ATOM'], u'wascaptive': ['ATOM'], u'wasinvasive': ['ATOM'], u'waterbody': ['TEXT'], u'year': ['TEXT', 'NUMBER'], }
http://amazoniabiodiversity.vertnet-portal.appspot.com/ as of 2014-12-22)
Namespace: index-2014-02-05a (ACTIVE:- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-05a
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-02-05a
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 25560245
- Storage limit: 10737418240
- Original limit: 10737418240
- Usage: 0.2%
- Status: responsive
- 2015-05-22. In use for Amazonian Biodiversity Portal. Removed resources up through NYBG in error. Needs repopulating.
- 2016-10-21. Storage usage 1505452858.
Namespace: index-2014-02-11a (INACTIVE: Production index for VN portal from 2014-12-22 to 2016-07-26)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-11a
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-02-11a
- Name: dwc
- Text search page: http://goo.gl/99mhgL
- Date: 2014-03-26 11:26
- Storage usage: 107118
- Storage limit: 268435456000
- Original limit: 10737418240
- Usage: 0% (17 records)
- Status: responsive
- 2014-10-16
- 11:20 107118L (17 records)
- 11:30 107118L (0 records)
- 2014-12-22 Search shows no records in index.
- 2015-05-22 Search shows no records in index. "No documents meet these criteria."
- 2015-05-28 Loaded dwc2015 index schema with http://indexer.vertnet-portal.appspot.com/index-gcs-path?namespace=index-2014-02-06t2&index_name=dwc&gcs_path=vertnet-harvesting/data/2015-05-22/mvz_hild-1627c464-1106-4d3c-bf3e-033b3f9d0fcc/*&shard_count=10. Index shows record. Commencing indexing for dwc2015 schema using this namespace.
Comments: Was originally a 10G index. Quota increased by Google. Was index-cleaned, then records loaded for testing and found responsive.
Schema: {u'family': ['TEXT'], u'stateprovince': ['ATOM', 'TEXT'], u'hastypestatus': ['NUMBER'], u'rank': ['NUMBER'], u'county': ['TEXT'], u'tissue': ['NUMBER'], u'year': ['TEXT'], u'specificepithet': ['TEXT'], u'media': ['NUMBER'], u'institutioncode': ['TEXT'], u'class': ['TEXT'], u'location': ['GEO_POINT'], u'collectorname': ['TEXT'], u'type': ['TEXT'], u'recordedby': ['TEXT'], u'verbatim_record': ['TEXT'], u'catalognumber': ['TEXT'], u'url': ['TEXT'], u'country': ['ATOM', 'TEXT'], u'mappable': ['NUMBER'], u'record': ['TEXT'], u'genus': ['TEXT'], u'eventdate': ['DATE']}
Namespace: index-2014-02-11 (INACTIVE: Shows as empty. Test before using.)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-11
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-02-11
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 32023049013
- Storage limit: 268435456000
- Original limit: 268435456000
- Status: responsive
- 2014-03-26 15:39:02.603, started index-clean to clean out the index for re-use since index-2014-03-12 failed with quota errors at around 3M records again.
- 2014-03-26 22:44:40.281 Finished index-clean on index index-2014-02-11.dwc. Removed 6254379 documents.
- 2015-05-22 Search shows no records in index. "No documents meet these criteria."
- Date: 2016-04-15 Have clean index before load.
- Storage usage: 0
- Storage limit: 268435456000
- Status: responsive
- 2016-07-26 Retains residual records (~8.5M) of an aborted index, one that was replaced by the Atomized version (index-2013-08-08) to improve performance.
- 2016-08-07 Removed 4728811 documents to clear the index. Unclear if it is functional now. Check before using.
Namespace: index-2014-03-12 (INACTIVE: Shows as empty. Test before using.)
- Was used by http://portal.vertnet.org/ up to 2014-12-22, now obsolete
- Emptied 2016-07-19 to make it reusable
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-03-12
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-03-12
- Name: dwc
- Text search page: http://goo.gl/w0MOK7
- Date: 2014-03-26 11:26
- Storage usage: 11639596921
- Storage limit: 268435456000
- Original limit: 10737418240
- Usage:
- Status: responsive
- 2014-03-26
- 11:00 11639596921L
- 11:26 12761579749L
- 11:48 13646012508L
- 12:59 15612212751L
- 14:20 19762280254L (document put rate quota errors)
- 15:26 21527682718L (still increasing, but indexing has failed)
- 13:56 45579236910L (nearing the end of loading index with 12M records from 2014-03-12, 2014-03-13, and 2014-03-27 harvests)
- 2014-10-17 11:29 62457510392L
- 2014-12-22 67333252006L (14,569,231 records)
- 2015-05-25 81353850720L (17,723,735 records)
Comments: First attempt to load resulted in quota overrun at 100% capacity of the 10G originally granted. Quota increased to 250G, but loading still had quota overrun for a couple of days. Once records could be loaded again without quota overrun, cleaned the 3038934 records. Redesigned index, then started loading again 25 Mar 2014 with largest data sets first. Loaded somewhere in the neighborhood of 3M records before emitting quota errors again, but these where errors based on the document inserts per minute, not the storage_quota for the index. Continued to load the index more conservatively, with no more than a couple of indexer jobs running simultaneously.
-
Schema: { u'catalognumber': ['TEXT'], u'class': ['TEXT'], u'continent': ['TEXT'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'country': ['TEXT'], u'county': ['TEXT'], u'eventdate': ['DATE'] u'family': ['TEXT'], u'fossil': ['NUMBER'], u'genus': ['TEXT'], u'hashid': ['NUMBER'], u'hastypestatus': ['NUMBER'], u'institutioncode': ['TEXT'], u'island': ['TEXT'], u'islandgroup': ['TEXT'], u'location': ['GEO_POINT'], u'mappable': ['NUMBER'], u'media': ['NUMBER'], u'occurrenceid': ['TEXT'], u'order': ['TEXT'], u'pubdate': ['TEXT'], u'rank': ['NUMBER'], u'record': ['TEXT'], u'recordedby': ['TEXT'], u'resource': ['TEXT'], u'specificepithet': ['TEXT'], u'stateprovince': ['TEXT'], u'tissue': ['NUMBER'], u'type': ['TEXT'], u'url': ['TEXT'], u'verbatim_record': ['TEXT'], u'year': ['TEXT', 'NUMBER'], }
-
Date: 2016-07-19 13:16T+02:00
-
Storage usage: 80416361337
-
Storage limit: 268435456000
-
Original limit: 10737418240 Schema: { u'catalognumber': ['TEXT'], u'class': ['TEXT'], u'continent': ['TEXT'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'country': ['TEXT'], u'county': ['TEXT'], u'eventdate': ['DATE'] u'family': ['TEXT'], u'fossil': ['NUMBER'], u'genus': ['TEXT'], u'hashid': ['NUMBER'], u'hastypestatus': ['NUMBER'], u'institutioncode': ['TEXT'], u'island': ['TEXT'], u'islandgroup': ['TEXT'], u'location': ['GEO_POINT'], u'mappable': ['NUMBER'], u'media': ['NUMBER'], u'occurrenceid': ['TEXT'], u'order': ['TEXT'], u'pubdate': ['TEXT'], u'rank': ['NUMBER'], u'record': ['TEXT'], u'recordedby': ['TEXT'], u'resource': ['TEXT'], u'specificepithet': ['TEXT'], u'stateprovince': ['TEXT'], u'tissue': ['NUMBER'], u'type': ['TEXT'], u'url': ['TEXT'], u'verbatim_record': ['TEXT'], u'year': ['TEXT', 'NUMBER'], } Emptying contents on 2016-07-19 to make it reusable. Old index is definitively no longer needed.
-
2016-07-26 Index empty.
Namespace: index-2014-02-06 (INACTIVE: Shows as empty. Test before using.)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-06
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-02-06
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 7791101 (emptied of documents)
- Storage limit: 268435456000
- Original limit: 268435456000
- Usage: 0%
- Status: not responsive
- 2015-05-22 Index found to have documents. Removed 1948 records from ccber. No records found after. "No documents meet these criteria."
Comments: Was 5.3% full with 14324192556L usage. index-cleaned but not responsive. Here is the final output from the cleaning run: 2014-03-21 08:07:16.902 /index-clean 200 736ms 27kb AppEngine-Google; (+http://code.google.com/appengine) module=default version=indexer 0.1.0.2 - - [21/Mar/2014:04:07:16 -0700] "POST /index-clean HTTP/1.1" 200 28408 "http://indexer.vertnet-portal.appspot.com/index-clean" "AppEngine-Google; (+http://code.google.com/appengine)" "indexer.vertnet-portal.appspot.com" ms=736 cpu_ms=86 cpm_usd=4.013175 queue_name=index-clean task_name=2572243230419474568 pending_ms=20 app_engine_release=1.9.1 instance=00c61b117cb0ac3e476edb20e488397bef46c4 I 2014-03-21 08:07:16.898 Queuing index-clean task with params {'ndeleted': 8335200, 'max_delete': u'', 'namespace': u'index-2014-02-06', 'index_name': u'dwc', 'id': u'university-of-texas-at-arlington-amphibian-and-reptile-diversity-research-center/uta-herpetology/ffefa851-4c5f-4322-a8ce-6eaa23bd7e04', 'batch_size': u''} 2014-03-21 08:07:18.031 /index-clean 200 1083ms 4kb AppEngine-Google; (+http://code.google.com/appengine) module=default version=indexer 0.1.0.2 - - [21/Mar/2014:04:07:18 -0700] "POST /index-clean HTTP/1.1" 200 4155 "http://indexer.vertnet-portal.appspot.com/index-clean" "AppEngine-Google; (+http://code.google.com/appengine)" "indexer.vertnet-portal.appspot.com" ms=1084 cpu_ms=21 cpm_usd=0.560464 queue_name=index-clean task_name=10355228373950732507 app_engine_release=1.9.1 instance=00c61b117cb0ac3e476edb20e488397bef46c4 I 2014-03-21 08:07:18.030 Finished index-clean on index index-2014-02-06.dwc. Removed 8335228 documents.
Namespace: index-2014-01-10 (INACTIVE: Shows as having one record 8352 bytes. Test before using.)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-01-10
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-01-10
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 26139374438
- Storage limit: 268435456000
- Original limit: 268435456000
- Usage: 9.7%
- Status: not responsive
- 2015-05-22 Found to have documents. Ran index-clean. Removed 17204527 documents. Search after does not complete.
- 2015-05-28 Search showed: "No documents meet these criteria." Loaded dwc2015 index schema with http://indexer.vertnet-portal.appspot.com/index-gcs-path?namespace=index000001&index_name=dwc&gcs_path=vertnet-harvesting/data/2015-05-22/harvesttest-9fbe6712-cf12-4c0f-9a73-f60967ebb485/*&shard_count=10 and the index loaded with the one record and is apparently functional.
- 2016-10-21 listing shows 8352 used for one test record.
Namespace: index-2013-08-11 (INACTIVE: small, empty)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2013-08-11
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2013-08-11
- Name: dwc
- Date: 2016-07-20 12:35
- Storage usage: 114486115
- Storage limit: 10737418240
- Original limit: 10737418240
- Usage: %
- Status: responsive
- 2016-07-20. Created by mistake when attempting to load index-2013-08-08. Has records from just two data sets. Can be cleaned and re-used. Has schema exactly as defined by the indexer when first using Atom fields for booleans and some other fields.
- 2016-07-22. Cleaned of all records.
Schema: {u'haslifestage': ['ATOM'], u'family': ['TEXT', 'ATOM'], u'wasinvasive': ['ATOM'], u'stateprovince': ['TEXT'], u'hastypestatus': ['ATOM'], u'municipality': ['TEXT'], u'lifestage': ['TEXT'], u'rank': ['NUMBER'], u'infraspecificepithet': ['TEXT'], u'county': ['TEXT'], u'networks': ['TEXT'], u'phylum': ['TEXT', 'ATOM'], u'lastindexed': ['TEXT'], u'haslength': ['ATOM'], u'georeferencedby': ['TEXT'], u'catalognumber': ['TEXT'], u'startdayofyear': ['NUMBER'], u'wascaptive': ['ATOM'], u'coordinateuncertaintyinmeters': ['NUMBER'], u'continent': ['TEXT'], u'recordedby': ['TEXT'], u'hasmass': ['ATOM'], u'specificepithet': ['TEXT'], u'vntype': ['ATOM'], u'group': ['TEXT'], u'preparations': ['TEXT'], u'basisofrecord': ['ATOM'], u'year': ['NUMBER'], u'hasmedia': ['ATOM'], u'geodeticdatum': ['TEXT'], u'orgstateprovince': ['TEXT'], u'day': ['NUMBER'], u'member': ['TEXT'], u'mappable': ['ATOM'], u'typestatus': ['TEXT'], u'location': ['GEO_POINT'], u'hastissue': ['ATOM'], u'order': ['TEXT', 'ATOM'], u'recordnumber': ['TEXT'], u'migrator': ['TEXT'], u'kingdom': ['TEXT', 'ATOM'], u'islandgroup': ['TEXT'], u'reproductivecondition': ['TEXT'], u'orgcountry': ['TEXT'], u'institutioncode': ['TEXT'], u'dctype': ['ATOM'], u'iptrecordid': ['ATOM'], u'formation': ['TEXT'], u'locality': ['TEXT'], u'gbifpublisherid': ['ATOM'], u'waterbody': ['TEXT'], u'hashid': ['NUMBER'], u'month': ['NUMBER'], u'verbatim_record': ['TEXT'], u'class': ['TEXT', 'ATOM'], u'vernacularname': ['TEXT'], u'isfossil': ['ATOM'], u'hassex': ['ATOM'], u'license': ['TEXT'], u'country': ['TEXT'], u'lengthinmm': ['NUMBER'], u'georeferenceverificationstatus': ['TEXT'], u'sex': ['TEXT'], u'collectioncode': ['TEXT'], u'bed': ['TEXT'], u'establishmentmeans': ['TEXT'], u'haslicense': ['ATOM'], u'fieldnumber': ['TEXT'], u'island': ['TEXT'], u'scientificname': ['TEXT'], u'genus': ['TEXT', 'ATOM'], u'gbifdatasetid': ['ATOM'], u'eventdate': ['TEXT'], u'enddayofyear': ['NUMBER']}
Namespace: index-2014-02-06t (INACTIVE: small, empty)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-06t
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-02-06t
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 400726362 (emptied of documents)
- Storage limit: 10737418240
- Original limit: 10737418240
- Usage: 0%
- Status: responsive
- 2015-05-22 Has documents, including NYBG. Ran index-clean. Removed 148565 documents. Search after shows "No documents meet these criteria."
Namespace: index-2014-02-06t2 (INACTIVE: small, empty)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index-2014-02-06t2
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index-2014-02-06t2
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 22070956 (emptied of documents)
- Storage limit: 10737418240
- Original limit: 10737418240
- Usage: 0%
- Status: responsive
- 2015-05-22 Has documents. Ran index-clean. Removed 10200 documents. Search after shows "No documents meet these criteria."
- 2015-05-28 Loaded dwc2015 index schema with http://indexer.vertnet-portal.appspot.com/index-gcs-path?namespace=index-2014-02-06t2&index_name=dwc&gcs_path=vertnet-harvesting/data/2015-05-22/mvz_hild-1627c464-1106-4d3c-bf3e-033b3f9d0fcc/*&shard_count=10. Tested queries on the index and in the test portal. All functional.
- 2016-10-21 Storage usage: 3368138691. Ran index-clean.
Namespace: index000001 (INACTIVE: Shows as having one record 8352 bytes. Test before using.)
- http://indexer.vertnet-portal.appspot.com/list-indexes?namespace=index000001
- https://console.developers.google.com/project/vertnet-portal/appengine/search/index/dwc?namespace=index000001
- Name: dwc
- Date: 2014-03-26 11:26
- Storage usage: 32058766453
- Storage limit: 268435456000
- Original limit: 268435456000
- Usage: 11.9%
- Status: responsive
- 2015-05-22 Has documents. Ran index-clean. Removed 8166827 documents. Search after does not complete.
- 2015-05-28 Search does not complete. Loaded dwc2015 index schema with http://indexer.vertnet-portal.appspot.com/index-gcs-path?namespace=index000001&index_name=dwc&gcs_path=vertnet-harvesting/data/2015-05-22/harvesttest-9fbe6712-cf12-4c0f-9a73-f60967ebb485/*&shard_count=10 and the index loaded with the one record and is apparently functional.
Namespace: (None) dwc_search (INACTIVE: small, empty)
- http://indexer.vertnet-portal.appspot.com/list-indexes
- https://console.cloud.google.com/appengine/search/index/dwc_search?project=vertnet-portal
- Name: dwc_search
- Date: 2014-03-26 11:26
- Storage usage: 684574297
- Storage limit: 10737418240
- Original limit: 10737418240
- Usage: 6.4%
- Status: responsive