Elasticsearch - PSJoshi/Notes GitHub Wiki
General structure of elasticsearch API:
http://localhost:9200/index/type/document
An index is the parent structure and can be thought of as a database that houses many types. And index can represent any concept, but often will represent a whole system of components such as a shop or a bookstore. Types are contained in an index and are similar to database tables, with each type representing a collection of similar objects (like shirt or book). The document is a single instance or representation of an object of the parent type. Thus, for example, the book “My Experiments with Truth” may exist as a 'book' type in the index named 'bookstore'.
DELETE INDEX:
$ curl -XDELETE 'http://localhost:9200/bookstore'
DELETE TYPE:
$ curl -XDELETE 'http://localhost:9200/bookstore/book'
DELETE Single document:
$ curl -XDELETE 'http://localhost:9200/bookstore/book/1'
Get statistics of elastic database - say, threats_db
$ curl -XGET 'http://localhost:9200/threats_db/_stats?pretty'
Node stats
$ curl -XGET http://localhost:9200/_nodes/esnode0/stats?pretty
$ curl -XGET http://localhost:9200/_cat/nodes?v&h=n
Cluster health
$ curl -XGET http://localhost:9200/_cluster/health?format=json
Indices health
$ curl -XGET http://localhost:9200/_cluster/health?level=indices&format=json
Shards health
$ curl -XGET http://localhost:9200/_cluster/health?level=shards&format=json
Disk allocation
$ curl -XGET http://localhost:9200/_cat/allocation?format=json
List all indices
$ curl -XGET "http://localhost:9200/_cat/indices?v"
Count the no of documents in index(vulnerabilities)
$ curl -XGET "http://localhost:9200/vulnerabilites/ssllabs/_count"
Get mapping
$ curl -XGET http://localhost:9200/http-2017.12.31/_mapping
Create index using curl
$ curl -H 'Content-Type: application/json' -XPUT "http://localhost:9200/vulnerabilites" -d \
'{
"settings": {
"index": {
"refresh_interval": "15s",
"number_of_shards": 2,
"number_of_replicas": 1
}
}
}'
Pass json mappings to elastic index
$ curl -H "Content-Type: application/json" -XPUT --data @elastic-mappings.json http://localhost:9200/vulnerabilities
Return all the records
$ curl -XGET http://localhost:9200/nodes_stats/_search?pretty=true&q={'matchAll':{''}}
Information about nodes in cluster
$ curl -XGET http://localhost:9200/_nodes
Elasticsearch node statistics
$ curl -XGET http://localhost:9200/_nodes/stats?pretty
- Naming in Elasticsearch nodes - By default, elasticsearch nodes get autogenerated names from a Marvel corpus.
# elasticsearch.yml
node.name: "data-node-1"
- Elasticsearch automatically discovers other nodes on the network and joins them to your cluster. This is often not recommended: it is highly likely that you could end up with important data on random test nodes that other team member/developers have started up. Instead, configure elastic cluster to only use explicitly mentioned nodes.
# elasticsearch.yml
cluster.name: elk-prod
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["data-node-1","data-node-2"]
- Elasticsearch nodes do voting amongest themselves for a master node. Make sure that minimum_master_nodes is greater than total nodes / 2, so you never get a split brain (where-in multiple nodes think they are the master).
If you have two data nodes, we have to set minimum_master_nodes as:
# elasticsearch.yml
discovery.zen.minimum_master_nodes: 2
Now, if only a single node is down, master selection would not occur and the cluster wouldn’t accept writes. This is the reason you should choose number of master nodes accordingly.
- Be Generous with Memory If you are dealing with large datasets, try to give elasticsearch as much memory as you can. The correct amount will vary depending on your workload and it is recommended to measure memory usage on the cluster.
Tuning GC is a very important part of maximising performance, so you will definitely want GC logging enabled.
#
export ES_USE_GC_LOGGING=yes
The other most important setting for memory is ES_HEAP_SIZE and it should be set to half the memory available, but ensure you don’t set it above ~30GB.
# In case, if your data nodes have 96GB RAM, so we hit the max
# recommended heap size for a single node.
export ES_HEAP_SIZE=31G
- Choose the Right Number of Shards
If you use too few shards, you are not making full use of elastic cluster. But, if you increase shards too much, there are performance overheads (benchmark link - http://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/).
To make full use of the cluster, you should have:
number of shards * (number of replicas + 1) >= number of data nodes
# elasticsearch.yml
# The default number of shards is 5.
index.number_of_shards: 2
index.number_of_replicas: 1
- Rolling upgrades Cluster restarts can be quite common and as a result, you will prefer to have a rolling restart to minimize downtime. The correct way to restart a node is to disable rebalancing, restart one node, enable rebalancing, then repeat on the other nodes. More details - https://www.elastic.co/guide/en/elasticsearch/guide/current/_rolling_restarts.html
Elasticsearch index status is yellow
This happens when it's single node cluster (usually on development machine) and replica is set to 1 or more.
If elastic has replicas that it is not able to allocate to a Node, the status will be shown as Yellow.
To change the status: update the following parameter for elastic index
number_of_replicas : 0
and the status will be changed to green.
Ref - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html
Good links:
-
http://www.wilfred.me.uk/blog/2015/01/31/taming-a-wild-elasticsearch-cluster/
-
http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext
-
https://www.loggly.com/blog/nine-tips-configuring-elasticsearch-for-high-performance/
-
http://bluesock.org/~willkg/blog/dev/elasticsearch_part1_index.html
-
elastic cheatsheet - http://elasticsearch-cheatsheet.jolicode.com/
-
Awesome elasticsearch - https://github.com/dzharii/awesome-elasticsearch
Elasticsearch monitoring
-
https://www.opsview.com/resources/elasticsearch/blog/elasticsearch-monitoring-tools
-
https://medium.com/@dionnis/elasticsearch-monitoring-and-maintenance-tools-research-18c5fb45a747
-
https://stackify.com/monitoring-elasticsearch-getting-right/
-
Elasticsearch performance monitoring - https://www.datadoghq.com/blog/monitor-elasticsearch-performance-metrics
-
Extract elastic metrics and send it to grafana - https://github.com/trevorndodds/elasticsearch-metrics
-
Openstack monitoring - https://logz.io/blog/openstack-monitoring/
-
Monitor elasticsearch using graphite/grafana - https://logz.io/blog/monitor-elasticsearch-graphite-grafana/
-
Send elastic statistics to graphite - https://github.com/mattweber/es2graphite
-
Heartbeat monitoring using ELK - https://dzone.com/articles/monitor-service-uptime-with-heartbeat-and-the-elk
-
Elasticsearch as time series data store - https://www.elastic.co/blog/elasticsearch-as-a-time-series-data-store
Elasticsearch tunning
- Choosing a size for elastic nodes - https://qbox.io/support/article/choosing-a-size-for-nodes
- 10 elastic metrics to watch for - https://www.oreilly.com/ideas/10-elasticsearch-metrics-to-watch
- Configuring elastic for high performance - https://www.loggly.com/blog/nine-tips-configuring-elasticsearch-for-high-performance/
- Elasticsearch configuration and performance tunning - http://ozzimpact.github.io/development/elasticsearch-configuration-tuning
- Get most of elastic logs - https://logmatic.io/blog/get-the-most-of-your-elasticsearch-logs/
- Elasticsearch high performance tunning - https://www.loggly.com/blog/nine-tips-configuring-elasticsearch-for-high-performance/
Usage examples of elasticsearch
- Verizon’s logging platform (built with the Elastic Stack) collects and processes over 4 TB logs per day - https://www.elastic.co/use-cases/verizon-wireless
- GitHub uses Elasticsearch to query 130 billion lines of code - https://www.elastic.co/guide/en/elasticsearch/guide/2.x/getting-started.html
- Wikipedia uses Elasticsearch to provide full-text search with highlighted search snippets, and search-as-you-type and did-you-mean suggestions.
- The Guardian uses Elasticsearch to combine visitor logs with social -network data to provide real-time feedback to its editors about the public’s response to new articles.
- Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.
- GitHub uses Elasticsearch to query 130 billion lines of code.