Elasticsearch - PSJoshi/Notes GitHub Wiki

General structure of elasticsearch API:

http://localhost:9200/index/type/document

An index is the parent structure and can be thought of as a database that houses many types. And index can represent any concept, but often will represent a whole system of components such as a shop or a bookstore. Types are contained in an index and are similar to database tables, with each type representing a collection of similar objects (like shirt or book). The document is a single instance or representation of an object of the parent type. Thus, for example, the book “My Experiments with Truth” may exist as a 'book' type in the index named 'bookstore'.

DELETE INDEX:

$ curl -XDELETE 'http://localhost:9200/bookstore'

DELETE TYPE:

$ curl -XDELETE 'http://localhost:9200/bookstore/book'

DELETE Single document:

$ curl -XDELETE 'http://localhost:9200/bookstore/book/1'

Get statistics of elastic database - say, threats_db

$ curl -XGET 'http://localhost:9200/threats_db/_stats?pretty'

Node stats

$ curl -XGET http://localhost:9200/_nodes/esnode0/stats?pretty
$ curl -XGET http://localhost:9200/_cat/nodes?v&h=n

Cluster health

$ curl -XGET http://localhost:9200/_cluster/health?format=json

Indices health

$ curl -XGET http://localhost:9200/_cluster/health?level=indices&format=json

Shards health

$ curl -XGET http://localhost:9200/_cluster/health?level=shards&format=json

Disk allocation

$ curl -XGET http://localhost:9200/_cat/allocation?format=json

List all indices

$ curl -XGET "http://localhost:9200/_cat/indices?v"

Count the no of documents in index(vulnerabilities)

$ curl -XGET "http://localhost:9200/vulnerabilites/ssllabs/_count"

Get mapping

$ curl -XGET http://localhost:9200/http-2017.12.31/_mapping

Create index using curl

$ curl -H 'Content-Type: application/json' -XPUT "http://localhost:9200/vulnerabilites" -d \
'{
	"settings": {
		"index": {
			"refresh_interval": "15s",
			"number_of_shards": 2,
			"number_of_replicas": 1
		}
	}
}'

Pass json mappings to elastic index

$ curl -H "Content-Type: application/json" -XPUT --data @elastic-mappings.json http://localhost:9200/vulnerabilities

Return all the records

$ curl -XGET http://localhost:9200/nodes_stats/_search?pretty=true&q={'matchAll':{''}}

Information about nodes in cluster

$ curl -XGET http://localhost:9200/_nodes

Elasticsearch node statistics

$ curl -XGET http://localhost:9200/_nodes/stats?pretty

  • Naming in Elasticsearch nodes - By default, elasticsearch nodes get autogenerated names from a Marvel corpus.
# elasticsearch.yml
node.name: "data-node-1"
  • Elasticsearch automatically discovers other nodes on the network and joins them to your cluster. This is often not recommended: it is highly likely that you could end up with important data on random test nodes that other team member/developers have started up. Instead, configure elastic cluster to only use explicitly mentioned nodes.
# elasticsearch.yml
cluster.name: elk-prod

discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["data-node-1","data-node-2"]
  • Elasticsearch nodes do voting amongest themselves for a master node. Make sure that minimum_master_nodes is greater than total nodes / 2, so you never get a split brain (where-in multiple nodes think they are the master).

If you have two data nodes, we have to set minimum_master_nodes as:

# elasticsearch.yml
discovery.zen.minimum_master_nodes: 2

Now, if only a single node is down, master selection would not occur and the cluster wouldn’t accept writes. This is the reason you should choose number of master nodes accordingly.

  • Be Generous with Memory If you are dealing with large datasets, try to give elasticsearch as much memory as you can. The correct amount will vary depending on your workload and it is recommended to measure memory usage on the cluster.

Tuning GC is a very important part of maximising performance, so you will definitely want GC logging enabled.

# 
export ES_USE_GC_LOGGING=yes

The other most important setting for memory is ES_HEAP_SIZE and it should be set to half the memory available, but ensure you don’t set it above ~30GB.

# In case, if your data nodes have 96GB RAM, so we hit the max
# recommended heap size for a single node.
export ES_HEAP_SIZE=31G
  • Choose the Right Number of Shards

If you use too few shards, you are not making full use of elastic cluster. But, if you increase shards too much, there are performance overheads (benchmark link - http://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/).

To make full use of the cluster, you should have:

number of shards * (number of replicas + 1) >= number of data nodes

# elasticsearch.yml
# The default number of shards is 5.
index.number_of_shards: 2
index.number_of_replicas: 1

Elasticsearch index status is yellow

This happens when it's single node cluster (usually on development machine) and replica is set to 1 or more.

If elastic has replicas that it is not able to allocate to a Node, the status will be shown as Yellow.

To change the status: update the following parameter for elastic index

number_of_replicas : 0

and the status will be changed to green.

Ref - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html

Good links:

Elasticsearch monitoring

Elasticsearch tunning

Usage examples of elasticsearch

  • Verizon’s logging platform (built with the Elastic Stack) collects and processes over 4 TB logs per day - https://www.elastic.co/use-cases/verizon-wireless
  • GitHub uses Elasticsearch to query 130 billion lines of code - https://www.elastic.co/guide/en/elasticsearch/guide/2.x/getting-started.html
  • Wikipedia uses Elasticsearch to provide full-text search with highlighted search snippets, and search-as-you-type and did-you-mean suggestions.
  • The Guardian uses Elasticsearch to combine visitor logs with social -network data to provide real-time feedback to its editors about the public’s response to new articles.
  • Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.
  • GitHub uses Elasticsearch to query 130 billion lines of code.