Elasticsearch - kamialie/knowledge_corner GitHub Wiki

Contents

Overview

Elasticsearch is a Java application. Started off as scalable Lucene engine. Each shard is an inverted index of documents (self-contained Lucene).

Configuration settings can be found:

  • deb - /etc/default/elasticsearch
  • rpm - /etc/sysconfig/elasticsearch

Now used not only for searching data, but

  • metrics (average, stats, min/max, percentile)
  • buckets (histograms, ranges, distances, significant terms)
  • pipelines (moving average, average bucket, cumulative sum)
  • matrix (matrix stats)

Terminology

Document is the thing user is searching for. Could be simple text or any structured JSON data. Every document has an ID and a type.

DB Elastic
Tables Indexes
Rows Documents
Columns Fields
Term Meaning
Index (noun) Similar to database in relational database
Index (verb) To store a document in an index (noun), similar to INSERT, except repeating action replaces old document
Inverted index Similar to index added by relational database (f.e. B-tree index) Elasticsearch and Lucene use inverted index to improve speed of data retrieval

Mapping

Mapping is a schema definition (ES has reasonable defaults, but sometimes needs to be customized). It defines how JSON documents will be stored and how the metadata structure. Mapping can be explicit (fields and types are predefined) or dynamic (automatically defined by ES).

Mapping has "safety zone" that can automatically cast data to the explicitly stated type, f.e. string to integer. If ES can't cast mismatching types of fields, then exception is thrown. index.mapping.ignore_malformed index setting can be set to true to ignore these kind of errors. Mismatched fields are going to be set as ignored. Can't handle JSON object (will still throw error).

  • field types - string, byte, short, integer, float, long, double, boolean, date

     "properties": {
     	"user_id": {
     		"type": "long"
     	}
     }
  • field index - if field should be indexed for full-text search - analyzed / not analyzed / no

     "properties": {
     	"genre": {
     		"index": "not_analyzed"
     	}
     }
  • field analyzer - define tokenizer and token filter - standard / whitespace / simple / English / etc

     "properties": {
     	"description": {
     		"analyzer": "english"
     	}
     }

Mapping explosion is when there is large or unknown number of inner fields. By default, ES creates a mapping (type) for each field. Flattened data type is used to combine object with inner fields into one field with type flattened. Adding new field (f.e. through update) doesn't change the mapping. Limitation is that inner fields are going to be treated as keywords (no analyzers or tokenizers can be applied), thus, can be searched only via direct match. To search for exact inner fields use dot notation.

Index can be closed and opened (PUT request with _close or _open suffix) to allow some mapping changes.

Default mapping maximum number of fields is 1000. Can be changes with index.mapping.total_fields.limit setting.

Elasticsearch 7

  • concept of document type is replicated
  • Elasticsearch SQL is production ready
  • changes in configuration defaults
  • Lucene 8
  • several X-Pack plugins now included in ES
  • bundled Java runtime
  • cross-cluster replication
  • Index Lifecycle Management (ILM)
  • High-level REST client in Java (HLRC)
  • and more

Search engine

  • efficient data indexing
  • analyzing data

Data analysis

  • character filtering
  • tokenization
  • token filtering

Understanding patterns, such as URLs, emails, hashtags, currencies, etc.

F.e. <h1>The quick brown fox jumped over the lazy dog</h1> would be reformatted to [quick] [brown] [fox] [jump] [lazy] [dog]. Applied steps would be HTML char filter, standard tokenazer, lowercase tokenizer filter, stop-work token filter and snowball token filter.

Character filtering

Character filters clean up text before tokenazation, f.e. remove HTML tags or replace & by and.

Tokenization

Splitting text into words (understanding word boundaries) or individual terms. Simple version is splitting based on white spaces and punctuation.

Token Filtering

Token filters can change terms (f.e. lowercase of remove plural), remove terms (f.e. stopwords such as a and the) or add terms (f.e. add synonym jump to leap).

  • stemming
  • casting to lowercase
  • adding synonyms
  • stepwords

Stemming is the process of reducing inflected (or sometimes derived) words into their stem, base or root form. F.e. conjunctions, plurals, etc. Ways to find stem:

  • lookup tables
  • suffix-stripping
  • lemmatization

Search

Type text allows partial matching and use of analyzers, while keyword type allows only exact match.

Examples

  • partial match

    GET /${index name}/_search

     {
     	"query": {
     		"match": {
     			"title": "Star Trek"
     		}
     	}
     }

Data storage

Inverted indexing

Data structure that stands at the core of any search engine, similar to the physical index of a book.

2 main components:

  • term dictionary - sorted list of items that occur in a give field across a set documents
  • postings list - the list of documents that contain the given term

Parent child relationship

To allow such relationship an index should have appropriate mappings. The following mapping sets franchise as parent and film as child.

{
	"mappings": {
		"properties": {
			"film_to_franchise": {
				"type": "join",
				"relations": {"franchise": "film"}
			}
		}
	}
}

Each document now has to set corresponding properties. Examples below set parent and child.

{ "create" : { "_index" : "series", "_id" : "1", "routing" : 1} }
{ "id": "1", "film_to_franchise": {"name": "franchise"}, "title" : "Star Wars" }
{ "create" : { "_index" : "series", "_id" : "260", "routing" : 1} }
{ "id": "260", "film_to_franchise": {"name": "film", "parent": "1"}, "title" : "Star    Wars: Episode IV - A New Hope", "year":"1977" , "genre":["Action", "Adventure", "Sci-Fi"] }

Example search query for retrieving all children of specified parent

{
	"query": {
		"has_parent": {
			"parent_type": "franchise",
			"query": {
				"match": {
					"title": "Star Wars"
				}
			}
		}
	}
}

Search query to find the parent of specified child

{
	"query": {
		"has_child": {
			"type": "film",
			"query": {
				"match": {
					"title": "the force awakenes"
				}
			}
		}
	}
}

Relevance scoring

TF/IDF, term frequency - inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a documents in a collection or corpus. Used in Elasticsearch (prior 5.0).

  • the more often a terms appears in a field (given document), more relevant (term frequency)
  • the more often a terms appears in inverted index (all documents), less relevant (inverted document frequency)
  • the longer the field, the less relevant each terms is (field length norm)

BM25. Default in Elasticsearch 5.0.

Routing

All data lives in a primary shard. Any particular document is only located in one of the shards. Routing is the process of determining which shard the document will reside in. shard = hash(routing) % number_of_primary_shards. All document APIs (get, index, delete, bulk, update) accept a routing parameter. Also particular field can be set as routing parameter.

This is the reason why number of shards can not be updated dynamically. However, replica shards can be added later on.

Write requests are routed to the primary shard, then replicated. Read requests are routed to primary or replica shards.

Replication

In a multi-node environment client sends requests to node1 (f.e.), which determines where document belongs (which shard) from it's id. It then sends request to that primary shard (may be located on another node) and the shard executes the request. On success the primary shard sends the same request to it's replicas (which may be located on other nodes) and waits for their success status. At the end the success is reported to the client.

  • sync (default) - the primary shard waits for successful responses from the replica shards before returning its response
  • async - the primary shard returns success after its own execution; replica requests are still sent, but result will not be known

Requests

ES server interacts with the client via REST APIs. Append ?pretty to any request to get nicely formatted respond.

# VERB - HTTP method, GET, POST, PUT, etc
# PROTOCOL - http or https
# HOST - the hostname of any node in the cluster or localhost for a node on
# local machine
# PORT - default 9200, can be set to different
# QUERY_STRING - any optional query-string parameters, f.e. ?pretty will
# pretty print the JSON
# BODY - a JSON-encoded requrest body (if needed)
$ curl -H "Content-Type: applicaiton/json" -X VERB 'PROTOCOL://HOST/PATH/?QUERY_STRING' -d 'BODY'

Filters ask a yes/no question of your data; they are fast and cacheble. Queries return data in terms of relevance. Filters can be combined inside queries and vice versa.

Filters

Wrapped in a "filter": {} block.

Some filter types:

Description Example
filter by exact values {"term": {"year": 2014}}
match if any values in a list match {"terms": {"genre": ["Sci-Fi", "Adventure"]}}
find numbers or dates in a given range {"range": {"year": {"gte": 2010}}}
find documents where field exists {"exists": {"field": "tags"}}
find documents where field is missing {"missing": {"field": "tags"}}
find documents where field is missing {"missing": {"field": "tags"}}

Combine filters with Boolean logic (must, must_not, should, which is equivalent of or).

{
	"query": {
		"bool": {
			"must": {"match": {"genre": "Sci-Fi"}},
			"must_not": {"match": {"title": "trek"}},
			"filter": {"range": {"year": {"gte": 2015, "lt": 2020"}}}
		}
	}
}

Queries

Wrapped in a "query": {} block.

Some types of queries:

Description Example
returns all documents and is a default, usually used with filter {"match_all": {}}
searches analyzed results, f.e. full text search; multiple terms will be treated independently, either can hit {"match": {"title": "star"}}
runs the same query on multiple fields {"multi_match": {"query": "star", "fields": ["title", "synopsis"]}}
finds all terms in the right order (phrase matching) {"match_phrase": {"title": "star wars"}}
slop value represents how far terms can be apart from each other, in either direction; f.e. "quick brown fox" is of slop of 1 for "quick fox" query {"match_phrase": {"title": {"query": "star wars", "slop": 1}}}

Boolean works similar to filter, but results are scored by relevance

Pagination

Pagination requires from (from 0) and size parameters.

{
	"from": 2,
	"size": 2,
	"query": {...}
}

Deep pagination kills performance (ES server still has to get all results to find out the ones requested). Also enforce upper bound on how many results can be returned to users.

Skipping from will default to 0, size - all.

Sorting

  • URL

    /{index_name}/_search?sort={field_name}

  • json data

     {
     	"sort": "year"
     }

Sorting numerical fields is straightforward - just specify the name of the field to the sort parameter. However, a string field used for full-text search can't be used for sorting (because it is saved as individual terms). One way to got around is to create a subfield that is not analyzed (keyword). Full copy of a text will be saved as a whole in this sub field. Since mapping can not be changed, this is something to consider before importing data.

{
	"mappings": {
		"properties": {
			"title": {
				"type": "text",
				"fields": {
					"raw": {
						"type": "keyword"
					}
				}
			}
		}
	}
}
{
	"sort": "title.raw"
}

Fuzzy matches

Levenshtein edit distance (examples are an edit distance of 1)

  • substitution of characters (f.e. interstellar -> intersteller)
  • insertions of characters (f.e. interstellar -> insterstellar)
  • deletion of characters (f.e. interstellar -> interstelar)
{
	"query": {
		"fuzzy": {
			"title": {
				"value": "intresteller", "fuzziness": 2
			}
		}
	}
}

There is also an auto setting for fuzziness:

  • 0 for 1-2 character strings
  • 1 for 3-5 character strings
  • 2 for the rest

Partial matches

Pretty efficient because of inverse index structure.

{
	"query": {
		"prefix": {
			"year": "201"
		}
	}
}
{
	"query": {
		"wildcard": {
			"year": "1*"
		}
	}
}

It is possible to simulate query-time search as you type using match_phrase_prefix and slop:

{
	"query": {
		"match_phrase_prefix": {
			"title": {
				"query": "star", "slop": 10
			}
		}
	}
}

N-grams

star:

  • unigram: [s, t, a, r]
  • bigram: [st, ta, ar]
  • trigram: [sta, tar]
  • 4-gram: [star]

Edge N-gram are built only at the beginning of each term.

First, create a custom analyzer: PUT /{index_name}

{
	"settings": {
		"analysis": {
			"filter": {
				"autocomplete_filter": {
					"type: "edge_ngram",
					"min_gram": 1,
					"max_gram": 20
				}
			},
			"analizer": {
				"autocomplete": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"lowercase",
						"autocomplete_filter"
					]
				}
			}
		}
	}
}

To test new analyzer run the following: GET /{index_name}/_analyze

{
	"analyzer": "autocomplete",
	"test": "Sta"
}

Second, set a mapping for an index (before putting data): PUT /{index_name}/_mapping

{
	"properties": {
		"title": {
			"type": "text",
			"analyzer": "autocomplete"
		}
	}
}

Third, only use ngrams on the index side (query is handled by standard analyzer in order not to create multiple terms, while it can match multiple words on the index side that are analyzed by ngram analyzer): GET /{index}/_search

{
	"query": {
		"match": {
			"title": {
				"query": "sta",
				"analyzer": "standard"
			}
		}
	}
}

Autocomplete

search_as_you_type data type

Aggregations

Following request would return star count (0.0, 3.5, etc) and number of documents that contain that value. Block name inside aggs (it is ratings in the request below) is set by user, can be anything.

GET /ratings/_search?size=0

{
	"aggs": {
		"ratings": {
			"terms": {
				"field": "rating"
			}
		}
	}
}

It is also possible to narrow down the aggregation by combining search with query block.

GET /ratings/_search?size=0

{
	"query": {
		"match_phrase": {
			"title": "Star Wars Episode IV"
		}
	},
	"aggs": {
		"avg_rating": {
			"avg": {
				"field": "rating"
			}
		}
	}
}

Aggregations can also be nested. Doesn't work on text fields (since it is analyzed as individual terms).

GET /ratings/_search?size=0

{
	"query": {
		"match_phrase": {
			"title": "Star Wars"
		}
	},
	"aggs": {
		"titles": {
			"terms": {
				"field": "title.raw"
			},
			"aggs": {
				"avg_rating": {
					"avg": {
						"field": "rating"
					}
				}
			}
		}
	}
}

Histogram

GET /ratings/_search?size=0

{
	"aggs": {
		"whole_ratings": {
			"histogram": {
				"field": "rating",
				"interval": 1.0
			}
		}
	}
}

Time series

ES works well with aggregations based on time, automatically handles calendar rules, like leap years, and so on.

`GET /logs/_search?size=0

{
	"aggs": {
		"timestamp": {
			"date_histogram": {
				"field": "@timestamp",
				"interval": "hour"
			}
		}
	}
}

Examples

Index

  • create an index

    PUT /${index name}

  • get mapping of a given index

    GET /${index name}/_mapping

  • set a mapping

    PUT /${index name}/_mapping

     {
     	"mappings": {
     		"properties": {
     			"id": {"type": "integer"},
     			"year": {"type": "date"},
     			"genre": {"type": "keyword"},
     			"title": {"type": "text", "analyzer": "english"}
     		}
     	}
     }
  • bulk upload

    PUT /_bulk

     { "create" : { "_index" : "movies", "_id" : "135569" } }
     { "id": "135569", "title" : "Star Trek Beyond", "year":2016 , "genre":["Action",        "Adventure", "Sci-Fi"] }
     { "create" : { "_index" : "movies", "_id" : "122886" } }
     { "id": "122886", "title" : "Star Wars: Episode VII - The Force Awakens", "year":2015 , "genre":["Action", "Adventure", "Fantasy", "Sci-Fi", "IMAX"] }
     { "create" : { "_index" : "movies", "_id" : "109487" } }
     { "id": "109487", "title" : "Interstellar", "year":2014 , "genre":["Sci-Fi", "IMAX"] }
     { "create" : { "_index" : "movies", "_id" : "58559" } }
     { "id": "58559", "title" : "Dark Knight, The", "year":2008 , "genre":["Action",         "Crime", "Drama", "IMAX"] }
     { "create" : { "_index" : "movies", "_id" : "1924" } }
     { "id": "1924", "title" : "Plan 9 from Outer Space", "year":1959 , "genre":["Horror",   "Sci-Fi"] }
  • list indices (not a json response)

    GET /_cat/indices

  • delete an index

    DELETE /${index name}

Document

  • retrieve a document

    GET /${index name}/_doc/${document id}

  • upload a document

    POST /${index name}/_doc/${document id}

    POST /megacorp/_doc/1

     {
     	"first_name": "John",
     	"last_name": "Smith",
     	"age": "25",
     	"about": "I like to go rock climbing",
     	"interests": ["sports", "music"]
     }

    Index operation automatically creates an index, if one was not created before. and also creates a dynamic type mapping for the specific type if one has not yet been created. Good for development stage (probably not for producation). Id is autogenerated, if it is not specified. Related setting in conf/elasticsearch.yml:

     # disable dynamic creation of mapping for unmapped types
     index.mapper.dynamic: false
     # disable index autocreation
     action.auto_create_index: false

    Repeating action actually updates the document (read below).

  • update a document

    ES document is immutable, thus, every document has a _version field. When updating existing document, a new document is created with incremented _version field; the old document is marked for deletion. Attempting to index whole document actually updates existing document, while example below is used for partial update (one or many fields). If partial updates, doesn't actually changed any values of the field, the version isn't updated, while posting whole document always increments version.

    POST /${index name}/_doc/${document id}/_update

    POST /megacorp/_doc/1/_update

     {
     	"doc": {
     		"title": "Interstellar"
     	}
     }
  • delete a document

    DELETE /${index name}/_doc/${document id}

  • search

    GET /${index name}/_search

    If no body is specified with the request, all documents of the index are returned.

    • simple example

       {
       	"query": {
       		"match": {
       			"last_name": "Smith"
       		}
       	}
       }
    • more complicated example

       {
       	"query": {
       		"bool":{
       			"must": {
       				"match_phrase": {
       					"title": "star wars"
       				}
       			},
       			"filter": {
       				"range": {
       					"year":{
       						"gte": 1980
       					}
       				}
       			}
       		}
       	}
       }
    • full text

      • match one of or both words in any order
         {
         	"query": {
         		"match": {
         			"about": "rock climbing"
         		}
         	}
         }
      • match phrase as a whole
         {
         	"query": {
         		"match_phrase": {
         			"about": "rock climbing"
         		}
         	}
         }
    • only show specified fields (or exclude specified)

       {
       	"_source": [
       		"field1",
       		"field2",
       		"field3"
       	]
       }
       {
       	"_source": [
       		"exclude": [
       			"field1",
       			"field2",
       			"field3"
       		]
       	}
       }
    • show info on total number of items (will be on the same level as _source); to supress showing actual items also pass "size": "0"

       {
       	"track_total_hits": "true"
       }

Cluster

Cluster APIs documentation

  • health

    GET /_cluster/health

    Status is one of the 3:

    • green - all primary and replica shards are active
    • yellow - all primary shards are active, but not all replica shards are active
    • red - not all primary shards are active
  • Add an index; request below will set 3 shards as primary and each one will have 1 replica that is total of 3 replicas. Replica shards are not assigned on the same node as primary shards.

    PUT /blogs

     {
     	"settings": {
     		"number_of_shards": 3,
     		"number_of_replicas": 1
     	}
     }

    To add a node to an existing cluster set cluster.name setting to the same value.

  • change settings

    PUT /blogs/_settings

     {
     	"number_of_replicas": 2
     }
  • check allocation details

    GET /_cluster/allocation/explain

  • check shard allocation details

    GET /_cat/shards?v

Query Lite

A way to run a query without body, embedding information inside URL. However, all logical operators have to be URL encoded for use in browser. Good for quick experimenting (curl tests). Formally called URI Search.

F.e. to show movies in movies index that have star in the title: /movies/_search?q=title:star

SQL

Send regular SQL queries to /_xpack/sql with the following body (by default returns json, add format parameter to switch to regular SQL interface response - /_xpack/sql?format=txt):

{
	"query": "SQL query"
}

SQL queries are first translated to DSL (ES native language). To see the translated query, send request to /_xpack/sql/translate.

ES also offers SQL cli at bin/elasticsearch-sql-cli (at /usr/share/elasticsearch in docker container).

There are some limitations, for example, list data type can not be handled by SQL.

Maintenance

Import

Script

  • read data from distributed system
  • convert to JSON bulk format
  • submit via REST API

Client libraries

  • Python elasticsearch package
  • Elasticsearch-ruby
  • Elasticsearch.pm for perl
  • so on

Logstash

  • parses, transforms and filters data as it passes through
  • can derive structure from unstructured data
  • can anonymize personal data or exclude it entirely
  • can do geo-location lookups
  • can scale across many nodes
  • absorbs throughput from load spikes

Backup

Snapshots can be stored to NAS, S3, HDFS, Azure. Stores only changes since last snapshot.

Local index backup:

  • add path.repo setting in elasticsearch.yml

  • set up backup location

    PUT /_snapshot/backup_name

     {
     	"type": "fs",
     	"settings": {
     		"location": "/path.repo/location"
     	}
     }
  • check info of backup setup

    GET /_snapshot

  • start back up process (all open indices)

    PUT /_snapshot/backup_repo_name/snapshot_name

  • check info or status of snapshot

    GET /_snapshot/backup_repo_name/snapshot_name GET /_snapshot/backup_repo_name/snapshot_name/_status

Check available snapshots

`GET /_cat/snapshots/backup_repo_name`

To restore data from backup:

  • close all indices:

    POST /_all/_close

  • restore from snapshot:

    POST /_snapshot/backup_repo_name/snapshot_name/_restore

SLM (Snapshot Lifecycle Management) policy:

  • schedule - frequency and time to capture the snapshot data
  • name - adds a prefix to each snapshot
  • repository - location to store
  • config.indices - which indices to backup
  • retention - deletes snapshots based on min_count, max_count and expire_after

Policy creation example: PUT /_slm/policy/name_of_policy

{
	"schedule": "0 03 3 * * ?",
	"name": "<backup-{now/d}>",
	"repository": "repo_name",
	"config": {
		"indicies": ["*"]
	},
	"retention": {
		"expire_after": "60d"
	}
}

Check policy contents (also number of snapshots, next run and so on): GET /_slm/policy/name_of_policy?human

To execute policy immediately: GET /_slm/policy/name_of_policy/_execute

S3

  1. install the plugin, then restart the service
    $ elasticsearch-plugin install repository-s3
  2. add user credentials to ES keystore
    $ elasticsearch-keystore add s3.client.default.access_key
    $ elasticsearch-keystore add s3.client.default.secret_key
  3. reload security settings POST /_nodes/reload_secure_settings
  4. register bucket PUT /_snapshot/snapshot_name
    {
    	"type": "s3",
    	"settings": {
    		"bucket": "bucket_name"
    	}
    }

Data frame

Aggregate data by specific entities and create new summarized secondary indices.

Define:

  • identifier (need to comply with URL rules)
  • source (index)
  • pivot (transformation mechanism)
    • group_by - aggregate based on terms, histograms, date histogram
    • aggregations - calculate the average, min, max, sum, value count, etc
  • destination (index)

Restart

For fast and smooth restart (in case of OS, ES updates) disable index reallocation.

Rolling restart procedure:

  1. stop indexing new data if possible
  2. disable shard allocation
  3. shut down one node
  4. perform maintenance on a node, restart and confirm it joins the cluster
  5. re-enable shard allocation
  6. wait for cluster green status
  7. repeat steps 2-6 for all nodes
  8. resume indexing new data

Disable shard allocation: PUT /_cluster/settings

{
	"transient": {
		"cluster.routing.allocation.enable": "none"
	}
}

Enable shard allocation: PUT /_cluster/settings

{
	"transient": {
		"cluster.routing.allocation.enable": "all"
	}
}

Configuration

TuneELK

The number of primary shards can not be changed later. number_of_replicas setting is related to each primary shard, thus, the following configuration results in 6 shards:

PUT /new_index

{
	"settings": {
		"number_of_shards": 3,
		"number_of_replicas": 1
	}
}

Configuration files are located in ES_HOME/config directory:

  • elasticsearch.yml - configure various Elasticsearch modules
  • loggin.yml - configure Elasticsearch logging

Configuration settings can also be set by JAVA OPTS (options with -D prefix) or getops as parameters to elasticsearch command (higher priority than configurations in the above mentioned files):

$ elasticsearch -Des.network.host=10.0.0.4 --node.name=my-node
Option name Description
cluster.name default is elasticsearch; used to identify cluster, f.e. add new nodes
node.name can be skipped (will be autogenerated), better to set
node.master, node.data default is true for both - eligible to be master or eligible to store data; good idea to separate responsibilities and let one node to be coordinator (master) and the other to store data; neither is good, if intended use is "search load balancer" (fetching results from nodes, aggregating results, etc)

Dynamic settings

  • transient - changes are in effect until the cluster restarts; after restart these settings are erased
  • persistent - change the static configuration and survive cluster restart

PUT /_cluster/settings

{
	"persistent": {
		"discovery.zen.minimum_master_nodes": 2
	},
	"transient": {
		"indices.store.throttle.max_bytes_per_sec": "50mb"
	}
}

Settings value to null sets the default value for a setting.

Hardware

  • fast disk (preferably SSD with deadline or loop i/o scheduler)
  • RAID0 since ES is already redundant
  • can save from choosing less CPU
  • not recommended to use NAS, Network Attached Storage
  • ideally 64 GB (32 to ES, 32 to OS and disk), less that 8 not recommended

By default ES heap size is set to 1GB. Set ES_HEAP_SIZE environment variable to the desired size, f.e. 10g.

Index

Index level setting can be set per index and can be either static (can be set at index creation time or on closed index) or dynamic. Can be updated by PUT /{index_name/_settings request.

Option name Description
index.number_of_shards default is 5 (static); number of shards (splits) of an index
index.number_of_replicas default is 1 (dynamic); number of copies of an index; on a single node the cluster status will always be yellow, as there is no other node to assign replicas to
index.refresh_interval default is 1s (dynamic); how often to perform refresh operations (making recent changes to the index visible to search); -1 is to disable refresh; increasing refresh intervals increases ES overall throughput

Multiple indices can be used to aggregate huge data. Index alias rotation can be used to refer to different indices at different point in time. Alias can contains more than one index. The following example rotates logs indices as more are added.

POST /_aliases

{
	"actions": [
		{ "add": { "alias": "logs_current", "index": "logs_2017_06" }},
		{ "remove": { "alias": "logs_current", "index": "logs_2017_05" }},
		{ "add": { "alias": "logs_last_3_months", "index": "logs_2017_06" }},
		{ "remove": { "alias": "logs_last_3_months", "index": "logs_2017_03" }},
	]
}

ES 7 has a notion of index lifecycle management, which identifies 4 stages of index lifecycle, and can apply separate policies at each stage:

  • hot - index is actively updated and queried
  • warm - index is not updated anymore, but actively queried
  • cold - index is neither queried nor updated
  • delete - can be deleted

First policy needs to be created, then applied to an index.

PUT _ilm/policy/datastream_policy

{
	"policy": {
		"phases": {
			"hot": {
				"actions": {
					"rollover": {
						"max_size": "50GB",
						"max_age": "30d"
					}
				}
			},
			"delete": {
				"min_age": "90d",
				"actions": {
					"delete": {}
				}
			}
		}
	}
}

PUT _template/datastream_template

{
	"index_pattern": ["datastream-*"],
	"settings": {
		"number_of_shards": 1,
		"number_of_replicas": 1,
		"index.lifecycle.name": "datastream_policy",
		"index.lifecycle.rollover_alias": "datastream"
	}
}

Transaction log

Each shard has a transaction log (write ahead log) associated with it. Requests are buffered to that log based on configurations below and then flushed (committed) to hard disk whenever any of the conditions is met.

Option name Description
index.translog.flush_treshhold_ops default is unlimited; after how many operations to perform a flush
index.translog.flush_treshhold_size default is 200mb; once translog hits this size perform a flush
index.translog.flush_treshhold_period default is 30m; period with no flush happening to force a flush
index.translog.interval default is 5s; how often to check if flush is needed, randomized between the interval value and 2x that value

Routing

Could be used when a node needs maintenance - set allocation to none, turn off the node, make changes, start the node, set allocation back to default, all.

Option name Description
index.routing.allocation.enable default is all, other options are primaries, new_primaries, none; enabled shard allocation for a specific index

Slowlog

Slow log configurations can set time limits to determine if request should be logged as slow request. This logs are placed into a dedicated log file. All settings are dynamic. Logs are place in /var/log/elasticsearch directory.

Option name Description
index.search.slowlog.level one of warn, info, debug or trace
index.search.slowlog.threshold.query.warn default is 10s
index.search.slowlog.threshold.query.info default is 5s
index.search.slowlog.threshold.query.debug default is 2s
index.search.slowlog.threshold.query.trace default is 500ms
index.search.slowlog.threshold.fetch.warn default is 1s
index.search.slowlog.threshold.fetch.info default is 800ms
index.search.slowlog.threshold.fetch.debug default is 500ms
index.search.slowlog.threshold.fetch.trace default is 200ms

Path to components

Option name Description
path.conf path to directory that contains elasticsearch.yml and loggin.yml files
path.data path to directory where to store index data for this node; can optionally specify more than one location (separate by commas), causing data to be stripped across the locations, favouring those with most free space
path.work path to temporary files
path.logs path to log files
path.plugins path to where plugins are stored

Network

By default, Elasticsearch is bound to 0.0.0.0 address and listens on ports 9200-9300 for HTTP traffic and on 9300-9400 for node-to-node communication. Range is specified as in case the default port is found to be already in use, Elasticsearch increments by one and tries the next port until free one is found.

Elasticsearch can make use of two host addresses - bind_host is used for communicating with outside world, while publish_host is used by other nodes for internal communication. F.e. cloud setup can make use of 2 network interfaces for internal and external communications.

Option name Description
network.bind_host 192.168.0.1
network.publish_host 192.168.0.1
network.host set both settings above
transport.tcp.port bind port, defaults to 9300-9400
https.port 9200
https.enabled false to disable completely

Recovery throttling

[TBD]

Discovery

To add nodes into cluster just set the same cluster name. By default, multicast discovery is used. Unicast settings allows to explicitly control which nodes will be used to discover the cluster.

Option name Description
discovery.zen.minimum_master_nodes default is 1; how many master eligible nodes are discovered by new node to consider operational cluster; set higher for bigger clusters; (number of master-eligible nodes / 2) + 1
discovery.zen.ping.multicast.enabled default is true; turn off for setting unicast
discovery.zen.ping.unicast.hosts list of hosts that can be used to discover the cluster, f.e ["host1", "host2:port"]

Quorum means majority.

Kubernetes setup

k8s deployment from elastic

Concepts

Concurrency

Each document holds _seq_no(sequence number) and _primary_term(primary shard that owns that sequence) fields instead of single _version field. On update request a client can specify the sequence number and primary term fields that it intends to update. Sequence number is then incremented and any other requests with the old value are denied. Use retry_on_conflicts=N to automatically retry.

  • manually set fields

    PUT /${index name}/_doc/{document id}/?if_seq_no=${sequence number}&if_primary_term=${primary term}

  • set automatic options

    PUT /${index name}/_doc/{document id}/_update?retry_on_conflicts=5

Learning

Questions

  • match phrase with target value "start trek beyond", query "beyond star" and slop of 3 - why does it make a hit?
⚠️ **GitHub.com Fallback** ⚠️