Database Elastic Search Guidline - OpenData-tu/documentation GitHub Wiki

Version 0.2

Version Date Modified by Summary of changes
0.1 2017-05-16 Nico Tasche Initial version
0.2 2017-05-24 Nico Tasche Elastic for scale
0.2a 2017-06-06 Andres Ardila Formatting
0.3 2017-05-24 Nico Tasche Time in elasticsearch

Elastic Search Good-To-Know

This is a collection of good-to-know stuff which is specialized for the Open Data Database structure used in this project.

Search Basics

Elastic Search works with Indices and Doctypes.

To get all data from one index with a special doctype just write:

GET /index/doctype/search

You can remove the doctype to get all data from that index

GET /index/search

You can use _all as an index to get everything with one doctype

GET /_all/doctype/search

You can use index* as an index to get everything with one doctype and a subset of indicies. For example: index1: de_dwd index2: de_dlr

GET /de_*/rain/search

Returns all the rain data from index de_dwd and de_dlr

Mappings

Elastic search takes about everything as data. If data is beeing POSTed with an unknown index and doctype, elastic search creates a new index and doctype and tries to identify a mapping for the data. For that reason, it is useful to pre-crate an index to define all important datatypes.

For example, to create an index for dwd for the rain doctype it is possible to create it as follows:

PUT /de_dwd1
{
  "mappings": {
    "rain": {
      "properties": {
        "location": {
          "type": "geo_point"
        },
        "timestamp":{
            "type": "date"
          },
          "timestamp_data": {
            "type": "date"
          }
      }
    }
  }
}

This mapping ensures timestamps are recognized as date format and location is a geolocation point.

Time

Time in elastic search. Elastic search saves all time-stamps in milliseconds since epoch time and it is interpreted as UTC. In addition the original string, which has been used to add the time is saved as well.

To work with global data, everything should be added according to the timezone. So if you want to add a measurement for the current German summer time, you have to add it like 2017-07-11T12:29:35+02:00.

To do a query you can use date math to do so. This means, you can use things like now-10m to indicate you want to have the time now minus 10m.

"now" is always UTC.

Pure dates like 2017-07-11 are internally saved with the midnight timestamp. 2017-07-11T00:00:00:000.

Elastic Architecture to Scale

To make elastic search scale to infinity there are serveral aproaches

  1. Do not allow infinity-sized indexes, because it just doesn't work
  2. Adjust number of shards based on data source, so there has to be a way of adjusting the number via an user interface maybe
  3. Use replica to improve search/read performance
  4. Create multiple clusters if necessary to avoid network problems

Allow No Infinity-sized Indexes

For more on why it is a bad idea to use big indexes see: Elastic Search: Design for Scale

Basiclly you have to make sure the index size is finite. The best way of doing that is to limit the index by limiting the timeframe for which it holds data. To hold data from the German Wheather Service it might be a good idea to seperate the indices by month, or even by day. For example:

index name: dwd_2017_03

This index would contain all the data for the whole March 2017. If the number of sensors for one data provider increases, it still is possible to increase the number of primary shards for the next month or even make indexing even more fine-grained by separating by day as well:

index name: dwd_2017_03_21