elk - RicoJia/notes GitHub Wiki
Basic architecture:
- shard is a piece of an index. shards of multiple shards can be stored on a single server.
- see link
push/pull-architecture: pull architecture uses a broker (a message queue) and server will just pull stuff from it. Push is client -> request
Roles of nodes
- Master node: responsible for creating & deleting indices
- A node with the role is not automatically the master node, unless there're no other master-eligible nodes
- Data node:
- performing queries
- ingest node (absorb):
- ingest pipeline is a series of steps (processors) performed when indexing docs
- A simplified logstash pipeline. You can change fields, etc.
node.ingest: true
- Machine Learning.
. Run machine learning jobs? - Coordination node: Distribute queries and aggregate results
- See nodes:
GET /_cat/nodes?v
- dim means data, ingest, master. The first node launched is chosen as the master
- Master node: responsible for creating & deleting indices
sharding is done on index level, not on node/cluster level. Because an index can have arbitrary number of documents. Each node has a disk space limit.
- each shard is an apache lucene
- allows parallel search across sharding
- see sharding:
GET /_cat/indices?v
. pri means primary shards. - after
elastic 7
, there are 1 default shard. Before there were 5, and that causes oversharding on small indicies - has split and shrink API
- routing:
GET /INDEX/_doc/id -> routing(INDEX) -> shard -> primary shard
- Potential issues: everything is done async, so things can go wrong
shard A updated, shard B not updated due to network error
- "primary terms": counter how many times each shard has been changed
- Sequency num: number of total write operations on each index
- versioning: number that ++ for every modification to a doc. Now there's a better way
replicas: for fault-tolerance.
- replicas of shards. Number specified at index creation.
- Primary shard + replicas = 1 replica group
- So store replica on a different node (machine) from the primary shard, but they can be on the same machine for higher throughput
- Can take snapshots on indices as well
Set up
bin/elasticsearch-create-enrollment-token --scope kibana
- it's valid every 30 min
- Install kibana https://www.elastic.co/guide/en/kibana/current/targz.html
- elastic server is port 9200, kibana is
. Go to this one - Log into elastic server, generate a new password:
bin/elasticsearch-reset-password -u elastic
elastic, PASSWORD
Curl queries
- Bash query
curl --cacert config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -X GET https://localhost:9200
- you need to provide cacert certificate;
: user + password -
is to get the basic info of the cluster
- Query to list all products:
curl --cacert config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -X GET -H "Content-Type:application/json" https://localhost:9200/products/_search -d '{ "query": { "match_all": {} } }'
how the data is serialized.application/json
is a common one -
is to search inproducts
index -
means data, which expects you to submit an exact form format
- Bash query
SPin a second node:
- extract the elastic tar.gz file
- add this to
to ALL NODES, including the main-Xms1g -Xmx1g
- run the main node
- generate a node level token in the main token shell :
./bin/elasticsearch-create-enrollment-token --scope node
- go to the second node, change node name in
- launch the node
./bin/elasticsearch ----enrollment-token <TOKEN>
- Then by
GET /_cluster/health
, you should see status as green - check
GET /_cat/shards?v
, then you can see replicas being assigned two 2 nodes
# Basics: node, index, shards info #_means the cluster name, _ is optional. But you can see which nodes are running there GET /_cluster/health #cat is CAT format, human readable. nodes will list all nodes running inside the cluster, v is to make it descriptive GET /_cat/nodes?v # see all indices, including system indices shared with kibana. The ones starting with ```.``` means hidden # kibana stores configuration such as dashboard into an index. So when you launch an instance of kibana, that dashboard gets loaded GET /_cat/indices?v&expand_wildcards=all GET /_cat/indices?v GET /_cat/shards?v # create a new index, which is optional # automatically creates 1 primary, 1 replica. # The replica has not been assigned a node yet, so status is yellow (in ```GET /_cat/indices?v```) # Can verify that by ```GET /_cat/shards?v```. The primary one has started # The kibana shards will automatically add 1 replica once we have > 1 node. PUT /pages # Create a new document, with default id POST /prod/_doc { "date": "12-02", "timestamp": "12:00", "msg": "hello" } # add or replace the existing index='sample_id' document PUT /prod/_doc/sample_id { "date": "12-05", "timestamp": "3:00", "msg": "helloll" } # Check the document GET /prod/_doc/sample_id ####################################################################### # change a field in the document, just do POST with another field. Under the hood, **each document is an immutable**. But elastic made it look like a field has been updated/added POST /prod/_doc/sample_id { "msg": "herrff" } #post the update in a scripted way # Note you need ```_update```, and you need ```ctx._source``` to update the field POST /prod/_update/sample_id { "script": { "source": "ctx._source.msg = 'balalba'" } # you can do "source": "ctx._source.field += 1" # or even use param "source": "ctx._source.field += param" "params":{ "foo": 2 } } # Post change only if the primary term and sequence number match. This is for concurrency control # Note you need _update POST /prod/_update/sample_id?if_primary_term=1&if_seq_no=19 { "doc": { "msg": "baz2" } } POST /prod/_delete_by_query { "query": { "match_all": {} } } ################################################ # list all documents under the same index # without size, default is 10 GET /prod/_search?size=100 { "query": { "match_all": {} } } # update all matching documents. Note we need to use the script # Internally, a snapshot is taken, and all replica groups are searched simutaneously # "conflicts" means when there's a version conflict. Proceed means to skip docs in conflict, not the entire index POST /prod/_update_by_query { "conflicts": "proceed", "script":{ "source": "ctx._source.msg='sdf'" } } ####################################################################### # Delete the doc DELETE /prod/_doc/sample_id # Post change only if the primary term and sequence number match POST /prod/_update/sample_id?if_primary_term=1&if_seq_no=19 { "doc": { "msg": "baz2" } } ################################################ # bulk, which is using ndjson, not json. MUCH MORE EFFICIENT THAN a single POST # "create" will fail with version error if the document already exists # "index" will replace POST /_bulk { "index":{ "_index":"prod", "_id":"sample_id" } } { "name":"espresso", "price": 499 } { "index":{ "_index":"prod", "_id":"sample_id" } } { "name":"espresso3", "price": 499 } # Update one field, using update, doc POST /_bulk { "update":{ "_index":"prod", "_id":"sample_id" } } { "doc": { "name":"espresso44" } } # can specify index name here POST /prod/_bulk { "update":{ "_id":"sample_id" } } { "doc": { "name":"espresso55" } }
Bulk API, text Analysis & mapping
################################################ # Send bulk API # each has a \r\n at the end # Last line of the file should be empty # First part of the command is cacert. Note we are using ndjson. # @ means a file in the current directory, not a path curl --cacert ~/third_party_pkgs/elastic_stack/elasticsearch/config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -H "Content-Type: application/x-ndjson" -XPOST https://localhost:9200/prod/_bulk --data-binary "@products-bulk.json" ################################################ # Text Analysis # Text processing = character filter + tokenizer (break down a sentence to words in a list) + token filter (to lower case) # Outside of the analyzer, token is called "term" POST /_analyze { "text": "I love BEer", "analyzer": "standard" } # inverted index is a term. created for each field. Maintained by Apache Lucene # Terms are sorted alphabetically, for Relevance Scoring # Numbers are stored in BKD tree, for geospatial search https://drive.google.com/file/d/1SG7vPlKAqwuQjGVhhmbm0tDwFproSxKV/view?usp=sharing # Mapping is structure of the documents: fields, data types # Datatypes. # Object is a JSON object, and they can be nested as well. # But if there's an array of objects, then internally these objects are stored like field_1 [], field_2 []... and the order of each field might be different? So we use array below # Properties are used for storing data types. Then Object is transfored to a valid JSON object for Apache Lucene, e.g., "properties.name" DELETE /pages GET /pages PUT /pages { "mappings": { "properties": { "name": { "type": "text" }, "page_numer":{ "type": "double" }, "manufacturer": { "properties": { "name": { "type": "text" } } } } } } # nested # so each nested type is stored in a separate hidden document on itsown PUT /pages { "mappings": { "properties": { "name": { "type": "text" }, "page_numer":{ "type": "double" }, "manufacturer": { "type": "nested" } } } } # date - there's a date as well # keyword # search exact values. "text" type will break a sentence down to words in tokenizer, but keyword will not do anything (i.e., the keyword analyzer is noop analyzer) . # useful in aggregation, filter, sorting. E.g., email addresses POST /_analyze { "text": "[email protected]", "analyzer": "keyword" } # ECS: uniform fields for common tasks such as logging. Elastic Common Schema # Type Coercion - The first time we post the field, if you put 7.4, then later all inputs will be "coerced" into floats. # i.e, if you put "7.4", it may still be converted to 7.4 (float). But if you # have "7.4m", then there'd be trouble PUT /pages/_doc/1 { "price_f": "600" } GET /pages # Arrays # Arrays do not exist. They're just flattened # In array, they should be of the same data type. Else, type coercion will come in. POST /pages/_doc { "tags": ["a", "b"] } GET /pages/_search
Term-level Querying
################################################ # Term level Query # fetch by id GET /prod/_search { "query": { "ids": { "values": [1,2] } } } # how to search for partial date? Search by range # default date format is 2001/12/31 GET /prod/_search { "query": { "range": { "created": { "gte": "01-01-2001", "lte": "01-01-2005", "format": "dd-MM-yyyy" } } } } # Matched to non-null queries # "" in Elastic search is NOT null # So here, if tags is empty, it will be filtered out GET /prod/_search { "query": { "exists": { "field": "tags" } } } # prefix: only for text/keyword/wildcard types GET /prod/_search { "query": { "prefix": { "tags": "win" } } } # wildcard GET /prod/_search { "query": { "wildcard": { "tags": "*lco*" } } }
Date math
################################################ # date math, date +/- yr, with || separate GET /prod/_search { "query": { "range": { "created": { "gte": "2001/01/01||-1y" } } } } # you can do now alone or have relative as well the math GET /prod/_search { "query": { "range": { "created": { "gte": "now-1y" } } } }
Full text Queries
################################################ # Full text Queries # see fields & datatypes of an index GET /prod/_mapping # See results that partially contain the keywords. Because by default, we or the tokens inside the description GET /prod/_search { "query": { "match": { "description": "Pellentesque asdfa" } } } # Now see the exact match, by changing that to boolean and # Note that the order of "at" and "Pellentesque" still doesn't affect the result GET /prod/_search { "query": { "match": { "description": { "query": "at Pellentesque" , "operator": "and" } } } } # Now let's match the exact GET /prod/_search { "query": { "match_phrase": { "description": "Pellentesque at" } } } # Search the same term in two fields GET /prod/_search { "query": { "multi_match": { "query": "at", "fields": ["description", "tags"] } } }
- Logstash if [field] will evaluate false if the field itself is false, or the field doesn't exist. So a hack is:
mutate { add_field => { "[@metadata][some_field] =" => "NULL" } copy => { "[MSG_FIELD]" => "[@metadata][some_field]"} } if [@metadata][some_field] != "NULL" { aggregate { task_id => "%{host}" code => "map['some_field'] = event.get('[MSG_FIELD]')" map_action => "create_or_update" } }
- Metric beat is much easier than filebeat
def foo(self):
=> becomes field "statsd.foo.bar", metricbeat-<beatversion>-<date>, deleted manually
Logs: (logstash)
- system module logs (preprocessed by filebeat), like ssh history, etc. Bash pipeline cleans up system module formatting; grok
- Input logs: (moxi log), logging pipeline. Aggregating logs together (most processing)
- both pipelines are run at the same time in a single process
Logstash -> elastic search
- a transform in elasticsearch is a constant need for getting stats about
* heartbeat
- heartbeat is going to directly going to Elasticsearch
- SSL?
TLS - overhead?
- ELK has an SSL
- 3 month policy?
- Lifecycle management
elastic can see command keywords bash.command.keyword : * and agent.hostname.keyword : moxi32