Overview
Search engine
Requests
- Filters
- Queries
- Pagination
- Fuzzy matches
- Partial matches
- N-grams
- Aggregations
- Examples
  - Index
  - Document
  - Cluster
- Query Lite
Maintenance
- Data
  - Import
  - Backup
  - Data frame
Configuration
Kubernetes setup
Concepts
Hands-on

Overview

Elasticsearch is a Java application. Started off as scalable Lucene engine. Each shard is an inverted index of documents (self-contained Lucene).

Configuration settings can be found:

deb - /etc/default/elasticsearch
rpm - /etc/sysconfig/elasticsearch

Now used not only for searching data, but

metrics (average, stats, min/max, percentile)
buckets (histograms, ranges, distances, significant terms)
pipelines (moving average, average bucket, cumulative sum)
matrix (matrix stats)

Terminology

Document is the thing user is searching for. Could be simple text or any structured JSON data. Every document has an ID and a type.

DB	Elastic
Tables	Indexes
Rows	Documents
Columns	Fields

Term	Meaning
Index (noun)	Similar to database in relational database
Index (verb)	To store a document in an index (noun), similar to INSERT, except repeating action replaces old document
Inverted index	Similar to index added by relational database (f.e. B-tree index) Elasticsearch and Lucene use inverted index to improve speed of data retrieval

Mapping

Mapping is a schema definition (ES has reasonable defaults, but sometimes needs to be customized). It defines how JSON documents will be stored and how the metadata structure. Mapping can be explicit (fields and types are predefined) or dynamic (automatically defined by ES).

Mapping has "safety zone" that can automatically cast data to the explicitly stated type, f.e. string to integer. If ES can't cast mismatching types of fields, then exception is thrown. index.mapping.ignore_malformed index setting can be set to true to ignore these kind of errors. Mismatched fields are going to be set as ignored. Can't handle JSON object (will still throw error).

field types - string, byte, short, integer, float, long, double, boolean, date
```
 "properties": {
 	"user_id": {
 		"type": "long"
 	}
 }
```
field index - if field should be indexed for full-text search - analyzed / not analyzed / no
```
 "properties": {
 	"genre": {
 		"index": "not_analyzed"
 	}
 }
```
field analyzer - define tokenizer and token filter - standard / whitespace / simple / English / etc
```
 "properties": {
 	"description": {
 		"analyzer": "english"
 	}
 }
```

Mapping explosion is when there is large or unknown number of inner fields. By default, ES creates a mapping (type) for each field. Flattened data type is used to combine object with inner fields into one field with type flattened. Adding new field (f.e. through update) doesn't change the mapping. Limitation is that inner fields are going to be treated as keywords (no analyzers or tokenizers can be applied), thus, can be searched only via direct match. To search for exact inner fields use dot notation.

Index can be closed and opened (PUT request with _close or _open suffix) to allow some mapping changes.

Default mapping maximum number of fields is 1000. Can be changes with index.mapping.total_fields.limit setting.

Elasticsearch 7

concept of document type is replicated
Elasticsearch SQL is production ready
changes in configuration defaults
Lucene 8
several X-Pack plugins now included in ES
bundled Java runtime
cross-cluster replication
Index Lifecycle Management (ILM)
High-level REST client in Java (HLRC)
and more

Search engine

efficient data indexing
analyzing data

Data analysis

character filtering
tokenization
token filtering

Understanding patterns, such as URLs, emails, hashtags, currencies, etc.

F.e. <h1>The quick brown fox jumped over the lazy dog</h1> would be reformatted to [quick] [brown] [fox] [jump] [lazy] [dog]. Applied steps would be HTML char filter, standard tokenazer, lowercase tokenizer filter, stop-work token filter and snowball token filter.

Character filtering

Character filters clean up text before tokenazation, f.e. remove HTML tags or replace & by and.

Tokenization

Splitting text into words (understanding word boundaries) or individual terms. Simple version is splitting based on white spaces and punctuation.

Token Filtering

Token filters can change terms (f.e. lowercase of remove plural), remove terms (f.e. stopwords such as a and the) or add terms (f.e. add synonym jump to leap).

stemming
casting to lowercase
adding synonyms
stepwords

Stemming is the process of reducing inflected (or sometimes derived) words into their stem, base or root form. F.e. conjunctions, plurals, etc. Ways to find stem:

lookup tables
suffix-stripping
lemmatization

Search

Type text allows partial matching and use of analyzers, while keyword type allows only exact match.

Examples

partial match

GET /${index name}/_search

 {
 	"query": {
 		"match": {
 			"title": "Star Trek"
 		}
 	}
 }

Data storage

Inverted indexing

Data structure that stands at the core of any search engine, similar to the physical index of a book.

2 main components:

term dictionary - sorted list of items that occur in a give field across a set documents
postings list - the list of documents that contain the given term

Parent child relationship

To allow such relationship an index should have appropriate mappings. The following mapping sets franchise as parent and film as child.

{
	"mappings": {
		"properties": {
			"film_to_franchise": {
				"type": "join",
				"relations": {"franchise": "film"}
			}
		}
	}
}

Each document now has to set corresponding properties. Examples below set parent and child.

{ "create" : { "_index" : "series", "_id" : "1", "routing" : 1} }
{ "id": "1", "film_to_franchise": {"name": "franchise"}, "title" : "Star Wars" }
{ "create" : { "_index" : "series", "_id" : "260", "routing" : 1} }
{ "id": "260", "film_to_franchise": {"name": "film", "parent": "1"}, "title" : "Star    Wars: Episode IV - A New Hope", "year":"1977" , "genre":["Action", "Adventure", "Sci-Fi"] }

Example search query for retrieving all children of specified parent

{
	"query": {
		"has_parent": {
			"parent_type": "franchise",
			"query": {
				"match": {
					"title": "Star Wars"
				}
			}
		}
	}
}

Search query to find the parent of specified child

{
	"query": {
		"has_child": {
			"type": "film",
			"query": {
				"match": {
					"title": "the force awakenes"
				}
			}
		}
	}
}

Relevance scoring

TF/IDF, term frequency - inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a documents in a collection or corpus. Used in Elasticsearch (prior 5.0).

the more often a terms appears in a field (given document), more relevant (term frequency)
the more often a terms appears in inverted index (all documents), less relevant (inverted document frequency)
the longer the field, the less relevant each terms is (field length norm)

BM25. Default in Elasticsearch 5.0.

Routing

All data lives in a primary shard. Any particular document is only located in one of the shards. Routing is the process of determining which shard the document will reside in. shard = hash(routing) % number_of_primary_shards. All document APIs (get, index, delete, bulk, update) accept a routing parameter. Also particular field can be set as routing parameter.

This is the reason why number of shards can not be updated dynamically. However, replica shards can be added later on.

Write requests are routed to the primary shard, then replicated. Read requests are routed to primary or replica shards.

Replication

In a multi-node environment client sends requests to node1 (f.e.), which determines where document belongs (which shard) from it's id. It then sends request to that primary shard (may be located on another node) and the shard executes the request. On success the primary shard sends the same request to it's replicas (which may be located on other nodes) and waits for their success status. At the end the success is reported to the client.

sync (default) - the primary shard waits for successful responses from the replica shards before returning its response
async - the primary shard returns success after its own execution; replica requests are still sent, but result will not be known

Requests

ES server interacts with the client via REST APIs. Append ?pretty to any request to get nicely formatted respond.

# VERB - HTTP method, GET, POST, PUT, etc
# PROTOCOL - http or https
# HOST - the hostname of any node in the cluster or localhost for a node on
# local machine
# PORT - default 9200, can be set to different
# QUERY_STRING - any optional query-string parameters, f.e. ?pretty will
# pretty print the JSON
# BODY - a JSON-encoded requrest body (if needed)
$ curl -H "Content-Type: applicaiton/json" -X VERB 'PROTOCOL://HOST/PATH/?QUERY_STRING' -d 'BODY'

Filters ask a yes/no question of your data; they are fast and cacheble. Queries return data in terms of relevance. Filters can be combined inside queries and vice versa.

Filters

Wrapped in a "filter": {} block.

Some filter types:

Description	Example
filter by exact values	`{"term": {"year": 2014}}`
match if any values in a list match	`{"terms": {"genre": ["Sci-Fi", "Adventure"]}}`
find numbers or dates in a given range	`{"range": {"year": {"gte": 2010}}}`
find documents where field exists	`{"exists": {"field": "tags"}}`
find documents where field is missing	`{"missing": {"field": "tags"}}`
find documents where field is missing	`{"missing": {"field": "tags"}}`

Combine filters with Boolean logic (must, must_not, should, which is equivalent of or).

{
	"query": {
		"bool": {
			"must": {"match": {"genre": "Sci-Fi"}},
			"must_not": {"match": {"title": "trek"}},
			"filter": {"range": {"year": {"gte": 2015, "lt": 2020"}}}
		}
	}
}

Queries

Wrapped in a "query": {} block.

Some types of queries:

Description	Example
returns all documents and is a default, usually used with filter	`{"match_all": {}}`
searches analyzed results, f.e. full text search; multiple terms will be treated independently, either can hit	`{"match": {"title": "star"}}`
runs the same query on multiple fields	`{"multi_match": {"query": "star", "fields": ["title", "synopsis"]}}`
finds all terms in the right order (phrase matching)	`{"match_phrase": {"title": "star wars"}}`
slop value represents how far terms can be apart from each other, in either direction; f.e. "quick brown fox" is of slop of 1 for "quick fox" query	`{"match_phrase": {"title": {"query": "star wars", "slop": 1}}}`

Boolean works similar to filter, but results are scored by relevance

Pagination

Pagination requires from (from 0) and size parameters.

{
	"from": 2,
	"size": 2,
	"query": {...}
}

Deep pagination kills performance (ES server still has to get all results to find out the ones requested). Also enforce upper bound on how many results can be returned to users.

Skipping from will default to 0, size - all.

Sorting

URL

/{index_name}/_search?sort={field_name}
json data
```
 {
 	"sort": "year"
 }
```

Sorting numerical fields is straightforward - just specify the name of the field to the sort parameter. However, a string field used for full-text search can't be used for sorting (because it is saved as individual terms). One way to got around is to create a subfield that is not analyzed (keyword). Full copy of a text will be saved as a whole in this sub field. Since mapping can not be changed, this is something to consider before importing data.

{
	"mappings": {
		"properties": {
			"title": {
				"type": "text",
				"fields": {
					"raw": {
						"type": "keyword"
					}
				}
			}
		}
	}
}

{
	"sort": "title.raw"
}

Fuzzy matches

Levenshtein edit distance (examples are an edit distance of 1)

substitution of characters (f.e. interstellar -> intersteller)
insertions of characters (f.e. interstellar -> insterstellar)
deletion of characters (f.e. interstellar -> interstelar)

{
	"query": {
		"fuzzy": {
			"title": {
				"value": "intresteller", "fuzziness": 2
			}
		}
	}
}

There is also an auto setting for fuzziness:

0 for 1-2 character strings
1 for 3-5 character strings
2 for the rest

Partial matches

Pretty efficient because of inverse index structure.

{
	"query": {
		"prefix": {
			"year": "201"
		}
	}
}

{
	"query": {
		"wildcard": {
			"year": "1*"
		}
	}
}

It is possible to simulate query-time search as you type using match_phrase_prefix and slop:

{
	"query": {
		"match_phrase_prefix": {
			"title": {
				"query": "star", "slop": 10
			}
		}
	}
}

N-grams

star:

unigram: [s, t, a, r]
bigram: [st, ta, ar]
trigram: [sta, tar]
4-gram: [star]

Edge N-gram are built only at the beginning of each term.

First, create a custom analyzer: PUT /{index_name}

{
	"settings": {
		"analysis": {
			"filter": {
				"autocomplete_filter": {
					"type: "edge_ngram",
					"min_gram": 1,
					"max_gram": 20
				}
			},
			"analizer": {
				"autocomplete": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"lowercase",
						"autocomplete_filter"
					]
				}
			}
		}
	}
}

To test new analyzer run the following: GET /{index_name}/_analyze

{
	"analyzer": "autocomplete",
	"test": "Sta"
}

Second, set a mapping for an index (before putting data): PUT /{index_name}/_mapping

{
	"properties": {
		"title": {
			"type": "text",
			"analyzer": "autocomplete"
		}
	}
}

Third, only use ngrams on the index side (query is handled by standard analyzer in order not to create multiple terms, while it can match multiple words on the index side that are analyzed by ngram analyzer): GET /{index}/_search

{
	"query": {
		"match": {
			"title": {
				"query": "sta",
				"analyzer": "standard"
			}
		}
	}
}

Autocomplete

search_as_you_type data type

Aggregations

Following request would return star count (0.0, 3.5, etc) and number of documents that contain that value. Block name inside aggs (it is ratings in the request below) is set by user, can be anything.

GET /ratings/_search?size=0

{
	"aggs": {
		"ratings": {
			"terms": {
				"field": "rating"
			}
		}
	}
}

It is also possible to narrow down the aggregation by combining search with query block.

GET /ratings/_search?size=0

{
	"query": {
		"match_phrase": {
			"title": "Star Wars Episode IV"
		}
	},
	"aggs": {
		"avg_rating": {
			"avg": {
				"field": "rating"
			}
		}
	}
}

Aggregations can also be nested. Doesn't work on text fields (since it is analyzed as individual terms).

GET /ratings/_search?size=0

{
	"query": {
		"match_phrase": {
			"title": "Star Wars"
		}
	},
	"aggs": {
		"titles": {
			"terms": {
				"field": "title.raw"
			},
			"aggs": {
				"avg_rating": {
					"avg": {
						"field": "rating"
					}
				}
			}
		}
	}
}

Histogram

GET /ratings/_search?size=0

{
	"aggs": {
		"whole_ratings": {
			"histogram": {
				"field": "rating",
				"interval": 1.0
			}
		}
	}
}

Time series

ES works well with aggregations based on time, automatically handles calendar rules, like leap years, and so on.

`GET /logs/_search?size=0

{
	"aggs": {
		"timestamp": {
			"date_histogram": {
				"field": "@timestamp",
				"interval": "hour"
			}
		}
	}
}

Examples

Index

create an index

PUT /${index name}
get mapping of a given index

GET /${index name}/_mapping

set a mapping

PUT /${index name}/_mapping

 {
 	"mappings": {
 		"properties": {
 			"id": {"type": "integer"},
 			"year": {"type": "date"},
 			"genre": {"type": "keyword"},
 			"title": {"type": "text", "analyzer": "english"}
 		}
 	}
 }

bulk upload

PUT /_bulk

 { "create" : { "_index" : "movies", "_id" : "135569" } }
 { "id": "135569", "title" : "Star Trek Beyond", "year":2016 , "genre":["Action",        "Adventure", "Sci-Fi"] }
 { "create" : { "_index" : "movies", "_id" : "122886" } }
 { "id": "122886", "title" : "Star Wars: Episode VII - The Force Awakens", "year":2015 , "genre":["Action", "Adventure", "Fantasy", "Sci-Fi", "IMAX"] }
 { "create" : { "_index" : "movies", "_id" : "109487" } }
 { "id": "109487", "title" : "Interstellar", "year":2014 , "genre":["Sci-Fi", "IMAX"] }
 { "create" : { "_index" : "movies", "_id" : "58559" } }
 { "id": "58559", "title" : "Dark Knight, The", "year":2008 , "genre":["Action",         "Crime", "Drama", "IMAX"] }
 { "create" : { "_index" : "movies", "_id" : "1924" } }
 { "id": "1924", "title" : "Plan 9 from Outer Space", "year":1959 , "genre":["Horror",   "Sci-Fi"] }

list indices (not a json response)

GET /_cat/indices
delete an index

DELETE /${index name}

Document

retrieve a document

GET /${index name}/_doc/${document id}
upload a document

POST /${index name}/_doc/${document id}

POST /megacorp/_doc/1
```
 {
 	"first_name": "John",
 	"last_name": "Smith",
 	"age": "25",
 	"about": "I like to go rock climbing",
 	"interests": ["sports", "music"]
 }
```
Index operation automatically creates an index, if one was not created before. and also creates a dynamic type mapping for the specific type if one has not yet been created. Good for development stage (probably not for producation). Id is autogenerated, if it is not specified. Related setting in conf/elasticsearch.yml:
```
 # disable dynamic creation of mapping for unmapped types
 index.mapper.dynamic: false
 # disable index autocreation
 action.auto_create_index: false
```
Repeating action actually updates the document (read below).
update a document

ES document is immutable, thus, every document has a _version field. When updating existing document, a new document is created with incremented _version field; the old document is marked for deletion. Attempting to index whole document actually updates existing document, while example below is used for partial update (one or many fields). If partial updates, doesn't actually changed any values of the field, the version isn't updated, while posting whole document always increments version.

POST /${index name}/_doc/${document id}/_update

POST /megacorp/_doc/1/_update
```
 {
 	"doc": {
 		"title": "Interstellar"
 	}
 }
```
delete a document

DELETE /${index name}/_doc/${document id}

GET /${index name}/_search

If no body is specified with the request, all documents of the index are returned.

simple example

 {
 	"query": {
 		"match": {
 			"last_name": "Smith"
 		}
 	}
 }

more complicated example

 {
 	"query": {
 		"bool":{
 			"must": {
 				"match_phrase": {
 					"title": "star wars"
 				}
 			},
 			"filter": {
 				"range": {
 					"year":{
 						"gte": 1980
 					}
 				}
 			}
 		}
 	}
 }

full text

match one of or both words in any order

 {
 	"query": {
 		"match": {
 			"about": "rock climbing"
 		}
 	}
 }

match phrase as a whole

 {
 	"query": {
 		"match_phrase": {
 			"about": "rock climbing"
 		}
 	}
 }

only show specified fields (or exclude specified)

 {
 	"_source": [
 		"field1",
 		"field2",
 		"field3"
 	]
 }

 {
 	"_source": [
 		"exclude": [
 			"field1",
 			"field2",
 			"field3"
 		]
 	}
 }

show info on total number of items (will be on the same level as _source); to supress showing actual items also pass "size": "0"
```
 {
 	"track_total_hits": "true"
 }
```

Cluster

Cluster APIs documentation

health

GET /_cluster/health

Status is one of the 3:
- green - all primary and replica shards are active
- yellow - all primary shards are active, but not all replica shards are active
- red - not all primary shards are active
Add an index; request below will set 3 shards as primary and each one will have 1 replica that is total of 3 replicas. Replica shards are not assigned on the same node as primary shards.

PUT /blogs
```
 {
 	"settings": {
 		"number_of_shards": 3,
 		"number_of_replicas": 1
 	}
 }
```
To add a node to an existing cluster set cluster.name setting to the same value.
change settings

PUT /blogs/_settings
```
 {
 	"number_of_replicas": 2
 }
```
check allocation details

GET /_cluster/allocation/explain
check shard allocation details

GET /_cat/shards?v

Query Lite

A way to run a query without body, embedding information inside URL. However, all logical operators have to be URL encoded for use in browser. Good for quick experimenting (curl tests). Formally called URI Search.

F.e. to show movies in movies index that have star in the title: /movies/_search?q=title:star

SQL

Send regular SQL queries to /_xpack/sql with the following body (by default returns json, add format parameter to switch to regular SQL interface response - /_xpack/sql?format=txt):

{
	"query": "SQL query"
}

SQL queries are first translated to DSL (ES native language). To see the translated query, send request to /_xpack/sql/translate.

ES also offers SQL cli at bin/elasticsearch-sql-cli (at /usr/share/elasticsearch in docker container).

There are some limitations, for example, list data type can not be handled by SQL.

Maintenance

Import

Script

read data from distributed system
convert to JSON bulk format
submit via REST API

Client libraries

Python elasticsearch package
Elasticsearch-ruby
Elasticsearch.pm for perl
so on

Logstash

parses, transforms and filters data as it passes through
can derive structure from unstructured data
can anonymize personal data or exclude it entirely
can do geo-location lookups
can scale across many nodes
absorbs throughput from load spikes

Backup

Snapshots can be stored to NAS, S3, HDFS, Azure. Stores only changes since last snapshot.

Local index backup:

add path.repo setting in elasticsearch.yml

set up backup location

PUT /_snapshot/backup_name

 {
 	"type": "fs",
 	"settings": {
 		"location": "/path.repo/location"
 	}
 }

check info of backup setup

GET /_snapshot
start back up process (all open indices)

PUT /_snapshot/backup_repo_name/snapshot_name
check info or status of snapshot

GET /_snapshot/backup_repo_name/snapshot_name GET /_snapshot/backup_repo_name/snapshot_name/_status

Check available snapshots

`GET /_cat/snapshots/backup_repo_name`

To restore data from backup:

close all indices:

POST /_all/_close
restore from snapshot:

POST /_snapshot/backup_repo_name/snapshot_name/_restore

SLM (Snapshot Lifecycle Management) policy:

schedule - frequency and time to capture the snapshot data
name - adds a prefix to each snapshot
repository - location to store
config.indices - which indices to backup
retention - deletes snapshots based on min_count, max_count and expire_after

Policy creation example: PUT /_slm/policy/name_of_policy

{
	"schedule": "0 03 3 * * ?",
	"name": "<backup-{now/d}>",
	"repository": "repo_name",
	"config": {
		"indicies": ["*"]
	},
	"retention": {
		"expire_after": "60d"
	}
}

Check policy contents (also number of snapshots, next run and so on): GET /_slm/policy/name_of_policy?human

To execute policy immediately: GET /_slm/policy/name_of_policy/_execute

S3

install the plugin, then restart the service

$ elasticsearch-plugin install repository-s3

add user credentials to ES keystore

$ elasticsearch-keystore add s3.client.default.access_key
$ elasticsearch-keystore add s3.client.default.secret_key

reload security settings POST /_nodes/reload_secure_settings

{
	"type": "s3",
	"settings": {
		"bucket": "bucket_name"
	}
}

Data frame

Aggregate data by specific entities and create new summarized secondary indices.

Define:

identifier (need to comply with URL rules)
source (index)
pivot (transformation mechanism)
- group_by - aggregate based on terms, histograms, date histogram
- aggregations - calculate the average, min, max, sum, value count, etc
destination (index)

Restart

For fast and smooth restart (in case of OS, ES updates) disable index reallocation.

Rolling restart procedure:

stop indexing new data if possible
disable shard allocation
shut down one node
perform maintenance on a node, restart and confirm it joins the cluster
re-enable shard allocation
wait for cluster green status
repeat steps 2-6 for all nodes
resume indexing new data

Disable shard allocation: PUT /_cluster/settings

{
	"transient": {
		"cluster.routing.allocation.enable": "none"
	}
}

Enable shard allocation: PUT /_cluster/settings

{
	"transient": {
		"cluster.routing.allocation.enable": "all"
	}
}

Configuration

TuneELK

The number of primary shards can not be changed later. number_of_replicas setting is related to each primary shard, thus, the following configuration results in 6 shards:

PUT /new_index

{
	"settings": {
		"number_of_shards": 3,
		"number_of_replicas": 1
	}
}

Configuration files are located in ES_HOME/config directory:

elasticsearch.yml - configure various Elasticsearch modules
loggin.yml - configure Elasticsearch logging

Configuration settings can also be set by JAVA OPTS (options with -D prefix) or getops as parameters to elasticsearch command (higher priority than configurations in the above mentioned files):

$ elasticsearch -Des.network.host=10.0.0.4 --node.name=my-node

Option name	Description
cluster.name	default is elasticsearch; used to identify cluster, f.e. add new nodes
node.name	can be skipped (will be autogenerated), better to set
node.master, node.data	default is true for both - eligible to be master or eligible to store data; good idea to separate responsibilities and let one node to be coordinator (master) and the other to store data; neither is good, if intended use is "search load balancer" (fetching results from nodes, aggregating results, etc)

Dynamic settings

transient - changes are in effect until the cluster restarts; after restart these settings are erased
persistent - change the static configuration and survive cluster restart

PUT /_cluster/settings

{
	"persistent": {
		"discovery.zen.minimum_master_nodes": 2
	},
	"transient": {
		"indices.store.throttle.max_bytes_per_sec": "50mb"
	}
}

Settings value to null sets the default value for a setting.

Hardware

fast disk (preferably SSD with deadline or loop i/o scheduler)
RAID0 since ES is already redundant
can save from choosing less CPU
not recommended to use NAS, Network Attached Storage
ideally 64 GB (32 to ES, 32 to OS and disk), less that 8 not recommended

By default ES heap size is set to 1GB. Set ES_HEAP_SIZE environment variable to the desired size, f.e. 10g.

Index

Index level setting can be set per index and can be either static (can be set at index creation time or on closed index) or dynamic. Can be updated by PUT /{index_name/_settings request.

Option name	Description
index.number_of_shards	default is 5 (static); number of shards (splits) of an index
index.number_of_replicas	default is 1 (dynamic); number of copies of an index; on a single node the cluster status will always be yellow, as there is no other node to assign replicas to
index.refresh_interval	default is 1s (dynamic); how often to perform refresh operations (making recent changes to the index visible to search); -1 is to disable refresh; increasing refresh intervals increases ES overall throughput

Multiple indices can be used to aggregate huge data. Index alias rotation can be used to refer to different indices at different point in time. Alias can contains more than one index. The following example rotates logs indices as more are added.

POST /_aliases

{
	"actions": [
		{ "add": { "alias": "logs_current", "index": "logs_2017_06" }},
		{ "remove": { "alias": "logs_current", "index": "logs_2017_05" }},
		{ "add": { "alias": "logs_last_3_months", "index": "logs_2017_06" }},
		{ "remove": { "alias": "logs_last_3_months", "index": "logs_2017_03" }},
	]
}

ES 7 has a notion of index lifecycle management, which identifies 4 stages of index lifecycle, and can apply separate policies at each stage:

hot - index is actively updated and queried
warm - index is not updated anymore, but actively queried
cold - index is neither queried nor updated
delete - can be deleted

First policy needs to be created, then applied to an index.

PUT _ilm/policy/datastream_policy

{
	"policy": {
		"phases": {
			"hot": {
				"actions": {
					"rollover": {
						"max_size": "50GB",
						"max_age": "30d"
					}
				}
			},
			"delete": {
				"min_age": "90d",
				"actions": {
					"delete": {}
				}
			}
		}
	}
}

PUT _template/datastream_template

{
	"index_pattern": ["datastream-*"],
	"settings": {
		"number_of_shards": 1,
		"number_of_replicas": 1,
		"index.lifecycle.name": "datastream_policy",
		"index.lifecycle.rollover_alias": "datastream"
	}
}

Transaction log

Each shard has a transaction log (write ahead log) associated with it. Requests are buffered to that log based on configurations below and then flushed (committed) to hard disk whenever any of the conditions is met.

Option name	Description
index.translog.flush_treshhold_ops	default is unlimited; after how many operations to perform a flush
index.translog.flush_treshhold_size	default is 200mb; once translog hits this size perform a flush
index.translog.flush_treshhold_period	default is 30m; period with no flush happening to force a flush
index.translog.interval	default is 5s; how often to check if flush is needed, randomized between the interval value and 2x that value

Routing

Could be used when a node needs maintenance - set allocation to none, turn off the node, make changes, start the node, set allocation back to default, all.

Option name	Description
index.routing.allocation.enable	default is `all`, other options are `primaries`, `new_primaries`, `none`; enabled shard allocation for a specific index

Slowlog

Slow log configurations can set time limits to determine if request should be logged as slow request. This logs are placed into a dedicated log file. All settings are dynamic. Logs are place in /var/log/elasticsearch directory.

Option name	Description
index.search.slowlog.level	one of warn, info, debug or trace
index.search.slowlog.threshold.query.warn	default is `10s`
index.search.slowlog.threshold.query.info	default is `5s`
index.search.slowlog.threshold.query.debug	default is `2s`
index.search.slowlog.threshold.query.trace	default is `500ms`
index.search.slowlog.threshold.fetch.warn	default is `1s`
index.search.slowlog.threshold.fetch.info	default is `800ms`
index.search.slowlog.threshold.fetch.debug	default is `500ms`
index.search.slowlog.threshold.fetch.trace	default is `200ms`

Path to components

Option name	Description
path.conf	path to directory that contains elasticsearch.yml and loggin.yml files
path.data	path to directory where to store index data for this node; can optionally specify more than one location (separate by commas), causing data to be stripped across the locations, favouring those with most free space
path.work	path to temporary files
path.logs	path to log files
path.plugins	path to where plugins are stored

Network

By default, Elasticsearch is bound to 0.0.0.0 address and listens on ports 9200-9300 for HTTP traffic and on 9300-9400 for node-to-node communication. Range is specified as in case the default port is found to be already in use, Elasticsearch increments by one and tries the next port until free one is found.

Elasticsearch can make use of two host addresses - bind_host is used for communicating with outside world, while publish_host is used by other nodes for internal communication. F.e. cloud setup can make use of 2 network interfaces for internal and external communications.

Option name	Description
network.bind_host	192.168.0.1
network.publish_host	192.168.0.1
network.host	set both settings above
transport.tcp.port	bind port, defaults to 9300-9400
https.port	9200
https.enabled	`false` to disable completely

Recovery throttling

[TBD]

Discovery

To add nodes into cluster just set the same cluster name. By default, multicast discovery is used. Unicast settings allows to explicitly control which nodes will be used to discover the cluster.

Option name	Description
discovery.zen.minimum_master_nodes	default is 1; how many master eligible nodes are discovered by new node to consider operational cluster; set higher for bigger clusters; (number of master-eligible nodes / 2) + 1
discovery.zen.ping.multicast.enabled	default is `true`; turn off for setting unicast
discovery.zen.ping.unicast.hosts	list of hosts that can be used to discover the cluster, f.e ["host1", "host2:port"]

Quorum means majority.

Kubernetes setup

k8s deployment from elastic

Concepts

Concurrency

Each document holds _seq_no(sequence number) and _primary_term(primary shard that owns that sequence) fields instead of single _version field. On update request a client can specify the sequence number and primary term fields that it intends to update. Sequence number is then incremented and any other requests with the old value are denied. Use retry_on_conflicts=N to automatically retry.

manually set fields

PUT /${index name}/_doc/{document id}/?if_seq_no=${sequence number}&if_primary_term=${primary term}
set automatic options

PUT /${index name}/_doc/{document id}/_update?retry_on_conflicts=5

Learning

Questions

match phrase with target value "start trek beyond", query "beyond star" and slop of 3 - why does it make a hit?

Elasticsearch - kamialie/knowledge_corner GitHub Wiki

Contents

Overview

Terminology

Mapping

Elasticsearch 7

Search engine

Data analysis

Character filtering

Tokenization

Token Filtering

Search

Examples

Data storage

Inverted indexing

Parent child relationship

Relevance scoring

Routing

Replication

Requests

Filters

Queries

Pagination

Sorting

Fuzzy matches

Partial matches

N-grams

Autocomplete

Aggregations

Histogram

Time series

Examples

Index

Document

Cluster

Query Lite

SQL

Maintenance

Import

Script

Client libraries

Logstash

Backup

S3

Data frame

Restart

Configuration

Hardware

Index

Transaction log

Routing

Slowlog

Path to components

Network

Recovery throttling

Discovery

Kubernetes setup

Concepts

Concurrency

Learning

Questions

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️