Elasticsearch - kamialie/knowledge_corner GitHub Wiki
Elasticsearch is a Java application. Started off as scalable Lucene engine. Each shard is an inverted index of documents (self-contained Lucene).
Configuration settings can be found:
- deb - /etc/default/elasticsearch
- rpm - /etc/sysconfig/elasticsearch
Now used not only for searching data, but
- metrics (average, stats, min/max, percentile)
- buckets (histograms, ranges, distances, significant terms)
- pipelines (moving average, average bucket, cumulative sum)
- matrix (matrix stats)
Document is the thing user is searching for. Could be simple text or any structured JSON data. Every document has an ID and a type.
DB | Elastic |
---|---|
Tables | Indexes |
Rows | Documents |
Columns | Fields |
Term | Meaning |
---|---|
Index (noun) | Similar to database in relational database |
Index (verb) | To store a document in an index (noun), similar to INSERT, except repeating action replaces old document |
Inverted index | Similar to index added by relational database (f.e. B-tree index) Elasticsearch and Lucene use inverted index to improve speed of data retrieval |
Mapping is a schema definition (ES has reasonable defaults, but sometimes needs to be customized). It defines how JSON documents will be stored and how the metadata structure. Mapping can be explicit (fields and types are predefined) or dynamic (automatically defined by ES).
Mapping has "safety zone" that can automatically cast data to the explicitly
stated type, f.e. string to integer. If ES can't cast mismatching types of
fields, then exception is thrown. index.mapping.ignore_malformed
index
setting can be set to true
to ignore these kind of errors. Mismatched fields
are going to be set as ignored. Can't handle JSON object (will still throw
error).
-
field types - string, byte, short, integer, float, long, double, boolean, date
"properties": { "user_id": { "type": "long" } }
-
field index - if field should be indexed for full-text search - analyzed / not analyzed / no
"properties": { "genre": { "index": "not_analyzed" } }
-
field analyzer - define tokenizer and token filter - standard / whitespace / simple / English / etc
"properties": { "description": { "analyzer": "english" } }
Mapping explosion is when there is large or unknown number of inner fields. By
default, ES creates a mapping (type) for each field. Flattened data type is
used to combine object with inner fields into one field with type flattened
.
Adding new field (f.e. through update) doesn't change the mapping. Limitation
is that inner fields are going to be treated as keywords (no analyzers or
tokenizers can be applied), thus, can be searched only via direct match. To
search for exact inner fields use dot notation.
Index can be closed and opened (PUT request with _close
or _open
suffix) to
allow some mapping changes.
Default mapping maximum number of fields is 1000. Can be changes with
index.mapping.total_fields.limit
setting.
- concept of document type is replicated
- Elasticsearch SQL is production ready
- changes in configuration defaults
- Lucene 8
- several X-Pack plugins now included in ES
- bundled Java runtime
- cross-cluster replication
- Index Lifecycle Management (ILM)
- High-level REST client in Java (HLRC)
- and more
- efficient data indexing
- analyzing data
- character filtering
- tokenization
- token filtering
Understanding patterns, such as URLs, emails, hashtags, currencies, etc.
F.e. <h1>The quick brown fox jumped over the lazy dog</h1>
would be
reformatted to [quick] [brown] [fox] [jump] [lazy] [dog]
. Applied steps would
be HTML char filter, standard tokenazer, lowercase tokenizer filter,
stop-work token filter and snowball token filter.
Character filters clean up text before tokenazation, f.e. remove HTML tags or
replace &
by and
.
Splitting text into words (understanding word boundaries) or individual terms. Simple version is splitting based on white spaces and punctuation.
Token filters can change terms (f.e. lowercase of remove plural), remove terms (f.e. stopwords such as a and the) or add terms (f.e. add synonym jump to leap).
- stemming
- casting to lowercase
- adding synonyms
- stepwords
Stemming is the process of reducing inflected (or sometimes derived) words into their stem, base or root form. F.e. conjunctions, plurals, etc. Ways to find stem:
- lookup tables
- suffix-stripping
- lemmatization
Type text
allows partial matching and use of analyzers, while keyword
type
allows only exact match.
-
partial match
GET /${index name}/_search
{ "query": { "match": { "title": "Star Trek" } } }
Data structure that stands at the core of any search engine, similar to the physical index of a book.
2 main components:
- term dictionary - sorted list of items that occur in a give field across a set documents
- postings list - the list of documents that contain the given term
To allow such relationship an index should have appropriate mappings. The
following mapping sets franchise
as parent and film
as child.
{
"mappings": {
"properties": {
"film_to_franchise": {
"type": "join",
"relations": {"franchise": "film"}
}
}
}
}
Each document now has to set corresponding properties. Examples below set parent and child.
{ "create" : { "_index" : "series", "_id" : "1", "routing" : 1} }
{ "id": "1", "film_to_franchise": {"name": "franchise"}, "title" : "Star Wars" }
{ "create" : { "_index" : "series", "_id" : "260", "routing" : 1} }
{ "id": "260", "film_to_franchise": {"name": "film", "parent": "1"}, "title" : "Star Wars: Episode IV - A New Hope", "year":"1977" , "genre":["Action", "Adventure", "Sci-Fi"] }
Example search query for retrieving all children of specified parent
{
"query": {
"has_parent": {
"parent_type": "franchise",
"query": {
"match": {
"title": "Star Wars"
}
}
}
}
}
Search query to find the parent of specified child
{
"query": {
"has_child": {
"type": "film",
"query": {
"match": {
"title": "the force awakenes"
}
}
}
}
}
TF/IDF, term frequency - inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a documents in a collection or corpus. Used in Elasticsearch (prior 5.0).
- the more often a terms appears in a field (given document), more relevant (term frequency)
- the more often a terms appears in inverted index (all documents), less relevant (inverted document frequency)
- the longer the field, the less relevant each terms is (field length norm)
BM25. Default in Elasticsearch 5.0.
All data lives in a primary shard. Any particular document is only located in one of the shards. Routing is the process of determining which shard the document will reside in. shard = hash(routing) % number_of_primary_shards. All document APIs (get, index, delete, bulk, update) accept a routing parameter. Also particular field can be set as routing parameter.
This is the reason why number of shards can not be updated dynamically. However, replica shards can be added later on.
Write requests are routed to the primary shard, then replicated. Read requests are routed to primary or replica shards.
In a multi-node environment client sends requests to node1 (f.e.), which determines where document belongs (which shard) from it's id. It then sends request to that primary shard (may be located on another node) and the shard executes the request. On success the primary shard sends the same request to it's replicas (which may be located on other nodes) and waits for their success status. At the end the success is reported to the client.
- sync (default) - the primary shard waits for successful responses from the replica shards before returning its response
- async - the primary shard returns success after its own execution; replica requests are still sent, but result will not be known
ES server interacts with the client via REST APIs. Append ?pretty
to any
request to get nicely formatted respond.
# VERB - HTTP method, GET, POST, PUT, etc
# PROTOCOL - http or https
# HOST - the hostname of any node in the cluster or localhost for a node on
# local machine
# PORT - default 9200, can be set to different
# QUERY_STRING - any optional query-string parameters, f.e. ?pretty will
# pretty print the JSON
# BODY - a JSON-encoded requrest body (if needed)
$ curl -H "Content-Type: applicaiton/json" -X VERB 'PROTOCOL://HOST/PATH/?QUERY_STRING' -d 'BODY'
Filters ask a yes/no question of your data; they are fast and cacheble. Queries return data in terms of relevance. Filters can be combined inside queries and vice versa.
Wrapped in a "filter": {}
block.
Some filter types:
Description | Example |
---|---|
filter by exact values | {"term": {"year": 2014}} |
match if any values in a list match | {"terms": {"genre": ["Sci-Fi", "Adventure"]}} |
find numbers or dates in a given range | {"range": {"year": {"gte": 2010}}} |
find documents where field exists | {"exists": {"field": "tags"}} |
find documents where field is missing | {"missing": {"field": "tags"}} |
find documents where field is missing | {"missing": {"field": "tags"}} |
Combine filters with Boolean logic (must, must_not, should, which is equivalent of or).
{
"query": {
"bool": {
"must": {"match": {"genre": "Sci-Fi"}},
"must_not": {"match": {"title": "trek"}},
"filter": {"range": {"year": {"gte": 2015, "lt": 2020"}}}
}
}
}
Wrapped in a "query": {}
block.
Some types of queries:
Description | Example |
---|---|
returns all documents and is a default, usually used with filter | {"match_all": {}} |
searches analyzed results, f.e. full text search; multiple terms will be treated independently, either can hit | {"match": {"title": "star"}} |
runs the same query on multiple fields | {"multi_match": {"query": "star", "fields": ["title", "synopsis"]}} |
finds all terms in the right order (phrase matching) | {"match_phrase": {"title": "star wars"}} |
slop value represents how far terms can be apart from each other, in either direction; f.e. "quick brown fox" is of slop of 1 for "quick fox" query | {"match_phrase": {"title": {"query": "star wars", "slop": 1}}} |
Boolean works similar to filter, but results are scored by relevance
Pagination requires from (from 0) and size parameters.
{
"from": 2,
"size": 2,
"query": {...}
}
Deep pagination kills performance (ES server still has to get all results to find out the ones requested). Also enforce upper bound on how many results can be returned to users.
Skipping from will default to 0, size - all.
-
URL
/{index_name}/_search?sort={field_name}
-
json data
{ "sort": "year" }
Sorting numerical fields is straightforward - just specify the name of the field to the sort parameter. However, a string field used for full-text search can't be used for sorting (because it is saved as individual terms). One way to got around is to create a subfield that is not analyzed (keyword). Full copy of a text will be saved as a whole in this sub field. Since mapping can not be changed, this is something to consider before importing data.
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
{
"sort": "title.raw"
}
Levenshtein edit distance (examples are an edit distance of 1)
- substitution of characters (f.e. interstellar -> intersteller)
- insertions of characters (f.e. interstellar -> insterstellar)
- deletion of characters (f.e. interstellar -> interstelar)
{
"query": {
"fuzzy": {
"title": {
"value": "intresteller", "fuzziness": 2
}
}
}
}
There is also an auto setting for fuzziness:
- 0 for 1-2 character strings
- 1 for 3-5 character strings
- 2 for the rest
Pretty efficient because of inverse index structure.
{
"query": {
"prefix": {
"year": "201"
}
}
}
{
"query": {
"wildcard": {
"year": "1*"
}
}
}
It is possible to simulate query-time search as you type using match_phrase_prefix and slop:
{
"query": {
"match_phrase_prefix": {
"title": {
"query": "star", "slop": 10
}
}
}
}
star:
- unigram: [s, t, a, r]
- bigram: [st, ta, ar]
- trigram: [sta, tar]
- 4-gram: [star]
Edge N-gram are built only at the beginning of each term.
First, create a custom analyzer:
PUT /{index_name}
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type: "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analizer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
To test new analyzer run the following:
GET /{index_name}/_analyze
{
"analyzer": "autocomplete",
"test": "Sta"
}
Second, set a mapping for an index (before putting data):
PUT /{index_name}/_mapping
{
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
Third, only use ngrams on the index side (query is handled by standard analyzer
in order not to create multiple terms, while it can match multiple words on the
index side that are analyzed by ngram analyzer):
GET /{index}/_search
{
"query": {
"match": {
"title": {
"query": "sta",
"analyzer": "standard"
}
}
}
}
search_as_you_type data type
Following request would return star count (0.0, 3.5, etc) and number of
documents that contain that value. Block name inside aggs
(it is ratings
in
the request below) is set by user, can be anything.
GET /ratings/_search?size=0
{
"aggs": {
"ratings": {
"terms": {
"field": "rating"
}
}
}
}
It is also possible to narrow down the aggregation by combining search with
query
block.
GET /ratings/_search?size=0
{
"query": {
"match_phrase": {
"title": "Star Wars Episode IV"
}
},
"aggs": {
"avg_rating": {
"avg": {
"field": "rating"
}
}
}
}
Aggregations can also be nested. Doesn't work on text fields (since it is analyzed as individual terms).
GET /ratings/_search?size=0
{
"query": {
"match_phrase": {
"title": "Star Wars"
}
},
"aggs": {
"titles": {
"terms": {
"field": "title.raw"
},
"aggs": {
"avg_rating": {
"avg": {
"field": "rating"
}
}
}
}
}
}
GET /ratings/_search?size=0
{
"aggs": {
"whole_ratings": {
"histogram": {
"field": "rating",
"interval": 1.0
}
}
}
}
ES works well with aggregations based on time, automatically handles calendar rules, like leap years, and so on.
`GET /logs/_search?size=0
{
"aggs": {
"timestamp": {
"date_histogram": {
"field": "@timestamp",
"interval": "hour"
}
}
}
}
-
create an index
PUT /${index name}
-
get mapping of a given index
GET /${index name}/_mapping
-
set a mapping
PUT /${index name}/_mapping
{ "mappings": { "properties": { "id": {"type": "integer"}, "year": {"type": "date"}, "genre": {"type": "keyword"}, "title": {"type": "text", "analyzer": "english"} } } }
-
bulk upload
PUT /_bulk
{ "create" : { "_index" : "movies", "_id" : "135569" } } { "id": "135569", "title" : "Star Trek Beyond", "year":2016 , "genre":["Action", "Adventure", "Sci-Fi"] } { "create" : { "_index" : "movies", "_id" : "122886" } } { "id": "122886", "title" : "Star Wars: Episode VII - The Force Awakens", "year":2015 , "genre":["Action", "Adventure", "Fantasy", "Sci-Fi", "IMAX"] } { "create" : { "_index" : "movies", "_id" : "109487" } } { "id": "109487", "title" : "Interstellar", "year":2014 , "genre":["Sci-Fi", "IMAX"] } { "create" : { "_index" : "movies", "_id" : "58559" } } { "id": "58559", "title" : "Dark Knight, The", "year":2008 , "genre":["Action", "Crime", "Drama", "IMAX"] } { "create" : { "_index" : "movies", "_id" : "1924" } } { "id": "1924", "title" : "Plan 9 from Outer Space", "year":1959 , "genre":["Horror", "Sci-Fi"] }
-
list indices (not a json response)
GET /_cat/indices
-
delete an index
DELETE /${index name}
-
retrieve a document
GET /${index name}/_doc/${document id}
-
upload a document
POST /${index name}/_doc/${document id}
POST /megacorp/_doc/1
{ "first_name": "John", "last_name": "Smith", "age": "25", "about": "I like to go rock climbing", "interests": ["sports", "music"] }
Index operation automatically creates an index, if one was not created before. and also creates a dynamic type mapping for the specific type if one has not yet been created. Good for development stage (probably not for producation). Id is autogenerated, if it is not specified. Related setting in
conf/elasticsearch.yml
:# disable dynamic creation of mapping for unmapped types index.mapper.dynamic: false # disable index autocreation action.auto_create_index: false
Repeating action actually updates the document (read below).
-
update a document
ES document is immutable, thus, every document has a
_version
field. When updating existing document, a new document is created with incremented_version
field; the old document is marked for deletion. Attempting to index whole document actually updates existing document, while example below is used for partial update (one or many fields). If partial updates, doesn't actually changed any values of the field, the version isn't updated, while posting whole document always increments version.POST /${index name}/_doc/${document id}/_update
POST /megacorp/_doc/1/_update
{ "doc": { "title": "Interstellar" } }
-
delete a document
DELETE /${index name}/_doc/${document id}
-
search
GET /${index name}/_search
If no body is specified with the request, all documents of the index are returned.
-
simple example
{ "query": { "match": { "last_name": "Smith" } } }
-
more complicated example
{ "query": { "bool":{ "must": { "match_phrase": { "title": "star wars" } }, "filter": { "range": { "year":{ "gte": 1980 } } } } } }
-
full text
- match one of or both words in any order
{ "query": { "match": { "about": "rock climbing" } } }
- match phrase as a whole
{ "query": { "match_phrase": { "about": "rock climbing" } } }
- match one of or both words in any order
-
only show specified fields (or exclude specified)
{ "_source": [ "field1", "field2", "field3" ] }
{ "_source": [ "exclude": [ "field1", "field2", "field3" ] } }
-
show info on total number of items (will be on the same level as
_source
); to supress showing actual items also pass"size": "0"
{ "track_total_hits": "true" }
-
-
health
GET /_cluster/health
Status is one of the 3:
- green - all primary and replica shards are active
- yellow - all primary shards are active, but not all replica shards are active
- red - not all primary shards are active
-
Add an index; request below will set 3 shards as primary and each one will have 1 replica that is total of 3 replicas. Replica shards are not assigned on the same node as primary shards.
PUT /blogs
{ "settings": { "number_of_shards": 3, "number_of_replicas": 1 } }
To add a node to an existing cluster set
cluster.name
setting to the same value. -
change settings
PUT /blogs/_settings
{ "number_of_replicas": 2 }
-
check allocation details
GET /_cluster/allocation/explain
-
check shard allocation details
GET /_cat/shards?v
A way to run a query without body, embedding information inside URL. However, all logical operators have to be URL encoded for use in browser. Good for quick experimenting (curl tests). Formally called URI Search.
F.e. to show movies in movies
index that have star
in the title:
/movies/_search?q=title:star
Send regular SQL queries to /_xpack/sql
with the following body (by default
returns json, add format parameter to switch to regular SQL interface
response - /_xpack/sql?format=txt
):
{
"query": "SQL query"
}
SQL queries are first translated to DSL (ES native language). To see the
translated query, send request to /_xpack/sql/translate
.
ES also offers SQL cli at bin/elasticsearch-sql-cli
(at
/usr/share/elasticsearch
in docker container).
There are some limitations, for example, list data type can not be handled by SQL.
- read data from distributed system
- convert to JSON bulk format
- submit via REST API
- Python elasticsearch package
- Elasticsearch-ruby
- Elasticsearch.pm for perl
- so on
- parses, transforms and filters data as it passes through
- can derive structure from unstructured data
- can anonymize personal data or exclude it entirely
- can do geo-location lookups
- can scale across many nodes
- absorbs throughput from load spikes
Snapshots can be stored to NAS, S3, HDFS, Azure. Stores only changes since last snapshot.
Local index backup:
-
add
path.repo
setting inelasticsearch.yml
-
set up backup location
PUT /_snapshot/backup_name
{ "type": "fs", "settings": { "location": "/path.repo/location" } }
-
check info of backup setup
GET /_snapshot
-
start back up process (all open indices)
PUT /_snapshot/backup_repo_name/snapshot_name
-
check info or status of snapshot
GET /_snapshot/backup_repo_name/snapshot_name
GET /_snapshot/backup_repo_name/snapshot_name/_status
Check available snapshots
`GET /_cat/snapshots/backup_repo_name`
To restore data from backup:
-
close all indices:
POST /_all/_close
-
restore from snapshot:
POST /_snapshot/backup_repo_name/snapshot_name/_restore
SLM (Snapshot Lifecycle Management) policy:
- schedule - frequency and time to capture the snapshot data
- name - adds a prefix to each snapshot
- repository - location to store
- config.indices - which indices to backup
- retention - deletes snapshots based on min_count, max_count and expire_after
Policy creation example:
PUT /_slm/policy/name_of_policy
{
"schedule": "0 03 3 * * ?",
"name": "<backup-{now/d}>",
"repository": "repo_name",
"config": {
"indicies": ["*"]
},
"retention": {
"expire_after": "60d"
}
}
Check policy contents (also number of snapshots, next run and so on):
GET /_slm/policy/name_of_policy?human
To execute policy immediately:
GET /_slm/policy/name_of_policy/_execute
- install the plugin, then restart the service
$ elasticsearch-plugin install repository-s3
- add user credentials to ES keystore
$ elasticsearch-keystore add s3.client.default.access_key $ elasticsearch-keystore add s3.client.default.secret_key
- reload security settings
POST /_nodes/reload_secure_settings
- register bucket
PUT /_snapshot/snapshot_name
{ "type": "s3", "settings": { "bucket": "bucket_name" } }
Aggregate data by specific entities and create new summarized secondary indices.
Define:
- identifier (need to comply with URL rules)
- source (index)
- pivot (transformation mechanism)
-
group_by
- aggregate based on terms, histograms, date histogram -
aggregations
- calculate the average, min, max, sum, value count, etc
-
- destination (index)
For fast and smooth restart (in case of OS, ES updates) disable index reallocation.
Rolling restart procedure:
- stop indexing new data if possible
- disable shard allocation
- shut down one node
- perform maintenance on a node, restart and confirm it joins the cluster
- re-enable shard allocation
- wait for cluster green status
- repeat steps 2-6 for all nodes
- resume indexing new data
Disable shard allocation:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}
Enable shard allocation:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
The number of primary shards can not be changed later. number_of_replicas
setting is related to each primary shard, thus, the following configuration
results in 6 shards:
PUT /new_index
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
Configuration files are located in ES_HOME/config
directory:
- elasticsearch.yml - configure various Elasticsearch modules
- loggin.yml - configure Elasticsearch logging
Configuration settings can also be set by JAVA OPTS (options with -D prefix) or getops as parameters to elasticsearch command (higher priority than configurations in the above mentioned files):
$ elasticsearch -Des.network.host=10.0.0.4 --node.name=my-node
Option name | Description |
---|---|
cluster.name | default is elasticsearch; used to identify cluster, f.e. add new nodes |
node.name | can be skipped (will be autogenerated), better to set |
node.master, node.data | default is true for both - eligible to be master or eligible to store data; good idea to separate responsibilities and let one node to be coordinator (master) and the other to store data; neither is good, if intended use is "search load balancer" (fetching results from nodes, aggregating results, etc) |
Dynamic settings
- transient - changes are in effect until the cluster restarts; after restart these settings are erased
- persistent - change the static configuration and survive cluster restart
PUT /_cluster/settings
{
"persistent": {
"discovery.zen.minimum_master_nodes": 2
},
"transient": {
"indices.store.throttle.max_bytes_per_sec": "50mb"
}
}
Settings value to null
sets the default value for a setting.
- fast disk (preferably SSD with deadline or loop i/o scheduler)
- RAID0 since ES is already redundant
- can save from choosing less CPU
- not recommended to use NAS, Network Attached Storage
- ideally 64 GB (32 to ES, 32 to OS and disk), less that 8 not recommended
By default ES heap size is set to 1GB. Set ES_HEAP_SIZE
environment variable
to the desired size, f.e. 10g
.
Index level setting can be set per index and can be either static (can be
set at index creation time or on closed index) or dynamic. Can be updated by
PUT /{index_name/_settings
request.
Option name | Description |
---|---|
index.number_of_shards | default is 5 (static); number of shards (splits) of an index |
index.number_of_replicas | default is 1 (dynamic); number of copies of an index; on a single node the cluster status will always be yellow, as there is no other node to assign replicas to |
index.refresh_interval | default is 1s (dynamic); how often to perform refresh operations (making recent changes to the index visible to search); -1 is to disable refresh; increasing refresh intervals increases ES overall throughput |
Multiple indices can be used to aggregate huge data. Index alias rotation can be used to refer to different indices at different point in time. Alias can contains more than one index. The following example rotates logs indices as more are added.
POST /_aliases
{
"actions": [
{ "add": { "alias": "logs_current", "index": "logs_2017_06" }},
{ "remove": { "alias": "logs_current", "index": "logs_2017_05" }},
{ "add": { "alias": "logs_last_3_months", "index": "logs_2017_06" }},
{ "remove": { "alias": "logs_last_3_months", "index": "logs_2017_03" }},
]
}
ES 7 has a notion of index lifecycle management, which identifies 4 stages of index lifecycle, and can apply separate policies at each stage:
- hot - index is actively updated and queried
- warm - index is not updated anymore, but actively queried
- cold - index is neither queried nor updated
- delete - can be deleted
First policy needs to be created, then applied to an index.
PUT _ilm/policy/datastream_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "30d"
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
PUT _template/datastream_template
{
"index_pattern": ["datastream-*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"index.lifecycle.name": "datastream_policy",
"index.lifecycle.rollover_alias": "datastream"
}
}
Each shard has a transaction log (write ahead log) associated with it. Requests are buffered to that log based on configurations below and then flushed (committed) to hard disk whenever any of the conditions is met.
Option name | Description |
---|---|
index.translog.flush_treshhold_ops | default is unlimited; after how many operations to perform a flush |
index.translog.flush_treshhold_size | default is 200mb; once translog hits this size perform a flush |
index.translog.flush_treshhold_period | default is 30m; period with no flush happening to force a flush |
index.translog.interval | default is 5s; how often to check if flush is needed, randomized between the interval value and 2x that value |
Could be used when a node needs maintenance - set allocation to none
, turn off
the node, make changes, start the node, set allocation back to default, all
.
Option name | Description |
---|---|
index.routing.allocation.enable | default is all , other options are primaries , new_primaries , none ; enabled shard allocation for a specific index |
Slow log configurations can set time limits to determine if request should be
logged as slow request. This logs are placed into a dedicated log file. All
settings are dynamic. Logs are place in /var/log/elasticsearch
directory.
Option name | Description |
---|---|
index.search.slowlog.level | one of warn, info, debug or trace |
index.search.slowlog.threshold.query.warn | default is 10s
|
index.search.slowlog.threshold.query.info | default is 5s
|
index.search.slowlog.threshold.query.debug | default is 2s
|
index.search.slowlog.threshold.query.trace | default is 500ms
|
index.search.slowlog.threshold.fetch.warn | default is 1s
|
index.search.slowlog.threshold.fetch.info | default is 800ms
|
index.search.slowlog.threshold.fetch.debug | default is 500ms
|
index.search.slowlog.threshold.fetch.trace | default is 200ms
|
Option name | Description |
---|---|
path.conf | path to directory that contains elasticsearch.yml and loggin.yml files |
path.data | path to directory where to store index data for this node; can optionally specify more than one location (separate by commas), causing data to be stripped across the locations, favouring those with most free space |
path.work | path to temporary files |
path.logs | path to log files |
path.plugins | path to where plugins are stored |
By default, Elasticsearch is bound to 0.0.0.0
address and listens on ports
9200-9300 for HTTP traffic and on 9300-9400 for node-to-node communication.
Range is specified as in case the default port is found to be already in use,
Elasticsearch increments by one and tries the next port until free one is
found.
Elasticsearch can make use of two host addresses - bind_host
is used for
communicating with outside world, while publish_host
is used by other nodes
for internal communication. F.e. cloud setup can make use of 2 network
interfaces for internal and external communications.
Option name | Description |
---|---|
network.bind_host | 192.168.0.1 |
network.publish_host | 192.168.0.1 |
network.host | set both settings above |
transport.tcp.port | bind port, defaults to 9300-9400 |
https.port | 9200 |
https.enabled |
false to disable completely |
[TBD]
To add nodes into cluster just set the same cluster name. By default, multicast discovery is used. Unicast settings allows to explicitly control which nodes will be used to discover the cluster.
Option name | Description |
---|---|
discovery.zen.minimum_master_nodes | default is 1; how many master eligible nodes are discovered by new node to consider operational cluster; set higher for bigger clusters; (number of master-eligible nodes / 2) + 1 |
discovery.zen.ping.multicast.enabled | default is true ; turn off for setting unicast |
discovery.zen.ping.unicast.hosts | list of hosts that can be used to discover the cluster, f.e ["host1", "host2:port"] |
Quorum means majority.
Each document holds _seq_no
(sequence number) and _primary_term
(primary shard
that owns that sequence) fields instead of single _version
field. On update
request a client can specify the sequence number and primary term fields that
it intends to update. Sequence number is then incremented and any other requests
with the old value are denied. Use retry_on_conflicts=N
to automatically
retry.
-
manually set fields
PUT /${index name}/_doc/{document id}/?if_seq_no=${sequence number}&if_primary_term=${primary term}
-
set automatic options
PUT /${index name}/_doc/{document id}/_update?retry_on_conflicts=5
- match phrase with target value "start trek beyond", query "beyond star" and slop of 3 - why does it make a hit?