ELK

========================================================================

Theory

========================================================================

Basic architecture:
- shard is a piece of an index. shards of multiple shards can be stored on a single server.
- see link
push/pull-architecture: pull architecture uses a broker (a message queue) and server will just pull stuff from it. Push is client -> request
- Complete Guide
Ref: https://pdai.tech/md/db/nosql-es/elasticsearch.html
- Own diagram
Roles of nodes
1. Master node: responsible for creating & deleting indices
  - A node with the role is not automatically the master node, unless there're no other master-eligible nodes
2. Data node:
  - performing queries
3. ingest node (absorb):
  - ingest pipeline is a series of steps (processors) performed when indexing docs
  - A simplified logstash pipeline. You can change fields, etc. node.ingest: true
4. Machine Learning. node.ml. Run machine learning jobs?
5. Coordination node: Distribute queries and aggregate results
6. See nodes: GET /_cat/nodes?v
  - dim means data, ingest, master. The first node launched is chosen as the master
sharding is done on index level, not on node/cluster level. Because an index can have arbitrary number of documents. Each node has a disk space limit.
- each shard is an apache lucene
- allows parallel search across sharding
- see sharding: GET /_cat/indices?v. pri means primary shards.
- after elastic 7, there are 1 default shard. Before there were 5, and that causes oversharding on small indicies
- has split and shrink API
- routing: GET /INDEX/_doc/id -> routing(INDEX) -> shard -> primary shard
- Potential issues: everything is done async, so things can go wrong
```
shard A updated, shard B not updated due to network error
```
  1. "primary terms": counter how many times each shard has been changed
  2. Sequency num: number of total write operations on each index
  3. versioning: number that ++ for every modification to a doc. Now there's a better way
replicas: for fault-tolerance.
- replicas of shards. Number specified at index creation.
- Primary shard + replicas = 1 replica group
- So store replica on a different node (machine) from the primary shard, but they can be on the same machine for higher throughput
- Can take snapshots on indices as well

========================================================================

Hands-on

========================================================================

Set up
- bin/elasticsearch-create-enrollment-token --scope kibana
  - it's valid every 30 min
- Install kibana https://www.elastic.co/guide/en/kibana/current/targz.html
- elastic server is port 9200, kibana is localhost:5601. Go to this one
- Log into elastic server, generate a new password: bin/elasticsearch-reset-password -u elastic
  - elastic, PASSWORD
Curl queries
1. Bash query curl --cacert config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -X GET https://localhost:9200
  - you need to provide cacert certificate;
  - -u: user + password
  - -X GET is to get the basic info of the cluster
2. Query to list all products: curl --cacert config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -X GET -H "Content-Type:application/json" https://localhost:9200/products/_search -d '{ "query": { "match_all": {} } }'
  - Content-Type: how the data is serialized. application/json is a common one
  - products/_search is to search in products index
  - -d means data, which expects you to submit an exact form format
SPin a second node:
- extract the elastic tar.gz file
- add this to config/jvm.options.d/custom.options to ALL NODES, including the main
```
-Xms1g
-Xmx1g
```
- run the main node
- generate a node level token in the main token shell : ./bin/elasticsearch-create-enrollment-token --scope node
- go to the second node, change node name in config/elasticsearch.yml, node.name
- launch the node ./bin/elasticsearch ----enrollment-token <TOKEN>
- Then by GET /_cluster/health, you should see status as green
- check GET /_cat/shards?v, then you can see replicas being assigned two 2 nodes

Operations

# Basics: node, index, shards info
#_means the cluster name, _ is optional. But you can see which nodes are running there
GET /_cluster/health
#cat is CAT format, human readable. nodes will list all nodes running inside the cluster, v is to make it descriptive
GET /_cat/nodes?v
# see all indices, including system indices shared with kibana. The ones starting with ```.``` means hidden
# kibana stores configuration such as dashboard into an index. So when you launch an instance of kibana, that dashboard gets loaded
GET /_cat/indices?v&expand_wildcards=all
GET /_cat/indices?v
GET /_cat/shards?v

# create a new index, which is optional
# automatically creates 1 primary, 1 replica. 
# The replica has not been assigned a node yet, so status is yellow (in ```GET /_cat/indices?v```)
# Can verify that by ```GET /_cat/shards?v```. The primary one has started
# The kibana shards will automatically add 1 replica once we have > 1 node.
PUT /pages
# Create a new document, with default id
POST /prod/_doc
{
  "date": "12-02",
  "timestamp": "12:00",
  "msg": "hello"
}
# add or replace the existing index='sample_id' document
PUT /prod/_doc/sample_id
{
  "date": "12-05",
  "timestamp": "3:00",
  "msg": "helloll"
}
# Check the document
GET /prod/_doc/sample_id

#######################################################################
# change a field in the document, just do POST with another field. Under the hood, **each document is an immutable**. But elastic made it look like a field has been updated/added
POST /prod/_doc/sample_id
{
  "msg": "herrff"
}

#post the update in a scripted way
# Note you need ```_update```, and you need ```ctx._source``` to update the field
POST /prod/_update/sample_id
{
 "script": {
   "source": "ctx._source.msg = 'balalba'"
  }
  # you can do 
   "source": "ctx._source.field += 1"
   # or even use param
   "source": "ctx._source.field += param"
   "params":{
    "foo": 2
   }
}
# Post change only if the primary term and sequence number match. This is for concurrency control
# Note you need _update
POST /prod/_update/sample_id?if_primary_term=1&if_seq_no=19
{
  "doc": {
    "msg": "baz2"
  }
}

POST /prod/_delete_by_query
{
  "query": {
      "match_all": {}
  }
}
################################################
# list all documents under the same index
# without size, default is 10
GET /prod/_search?size=100
{
    "query": {
        "match_all": {}
    }
}
# update all matching documents. Note we need to use the script
# Internally, a snapshot is taken, and all replica groups are searched simutaneously
# "conflicts" means when there's a version conflict. Proceed means to skip docs in conflict, not the entire index 
POST /prod/_update_by_query
{
  "conflicts": "proceed", 
  "script":{
    "source": "ctx._source.msg='sdf'"
  }
}
#######################################################################
# Delete the doc
DELETE /prod/_doc/sample_id

# Post change only if the primary term and sequence number match
POST /prod/_update/sample_id?if_primary_term=1&if_seq_no=19
{
  "doc": {
    "msg": "baz2"
  }
}

################################################
# bulk, which is using ndjson, not json. MUCH MORE EFFICIENT THAN a single POST
# "create" will fail with version error if the document already exists
# "index" will replace
POST /_bulk
{ "index":{ "_index":"prod", "_id":"sample_id" } }
{ "name":"espresso", "price": 499 }
{ "index":{ "_index":"prod", "_id":"sample_id" } }
{ "name":"espresso3", "price": 499 }

# Update one field, using update, doc
POST /_bulk
{ "update":{ "_index":"prod", "_id":"sample_id" } }
{ "doc": { "name":"espresso44" } }

# can specify index name here
POST /prod/_bulk
{ "update":{ "_id":"sample_id" } }
{ "doc": { "name":"espresso55" } }

Bulk API, text Analysis & mapping

################################################
# Send bulk API 
# each has a \r\n at the end
# Last line of the file should be empty

# First part of the command is cacert. Note we are using ndjson. 
# @ means a file in the current directory, not a path
curl --cacert ~/third_party_pkgs/elastic_stack/elasticsearch/config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -H "Content-Type: application/x-ndjson" -XPOST https://localhost:9200/prod/_bulk --data-binary "@products-bulk.json"

################################################
# Text Analysis
# Text processing = character filter + tokenizer (break down a sentence to words in a list) + token filter (to lower case)
# Outside of the analyzer, token is called "term"
POST /_analyze
{
  "text": "I love BEer",
  "analyzer": "standard"
}

# inverted index is a term. created for each field. Maintained by Apache Lucene
# Terms are sorted alphabetically, for Relevance Scoring
# Numbers are stored in BKD tree, for geospatial search
https://drive.google.com/file/d/1SG7vPlKAqwuQjGVhhmbm0tDwFproSxKV/view?usp=sharing
# Mapping is structure of the documents: fields, data types

# Datatypes.
# Object is a JSON object, and they can be nested as well. 
# But if there's an array of objects, then internally these objects are stored like field_1 [], field_2 []... and the order of each field might be different? So we use array below 
# Properties are used for storing data types. Then Object is transfored to a valid JSON object for Apache Lucene, e.g., "properties.name"
DELETE /pages
GET /pages
PUT /pages
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },
      "page_numer":{ "type": "double" },
      "manufacturer": {
        "properties": {
          "name": { "type": "text" }
        }
      }
    }
  }
}

# nested
# so each nested type is stored in a separate hidden document on itsown
PUT /pages
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },
      "page_numer":{ "type": "double" },
      "manufacturer": { "type": "nested" }
    }
  }
}

# date - there's a date as well

# keyword
# search exact values. "text" type will break a sentence down to words in tokenizer, but keyword will not do anything (i.e., the keyword analyzer is noop analyzer) . 
# useful in aggregation, filter, sorting. E.g., email addresses
POST /_analyze
{
  "text": "[email protected]",
  "analyzer": "keyword"
}

# ECS: uniform fields for common tasks such as logging. Elastic Common Schema

# Type Coercion - The first time we post the field, if you put 7.4, then later all inputs will be "coerced" into floats.
# i.e, if you put "7.4", it may still be converted to 7.4 (float). But if you
# have "7.4m", then there'd be trouble
PUT /pages/_doc/1
{
  "price_f": "600"
}
GET /pages

# Arrays
# Arrays do not exist. They're just flattened
# In array, they should be of the same data type. Else, type coercion will come in.
POST /pages/_doc
{
  "tags": ["a", "b"]
}

GET /pages/_search

Term-level Querying

################################################
# Term level Query
# fetch by id
GET /prod/_search
{
  "query": {
    "ids": {
      "values": [1,2]
    }
  }
}

# how to search for partial date? Search by range
# default date format is 2001/12/31
GET /prod/_search
{
  "query": {
    "range": {
      "created": {
        "gte": "01-01-2001",
        "lte": "01-01-2005",
        "format": "dd-MM-yyyy"
      }

    }
  }
}

# Matched to non-null queries
# "" in Elastic search is NOT null
# So here, if tags is empty, it will be filtered out
GET /prod/_search
{
  "query": {
    "exists": {
      "field": "tags"
    }
  }
}

# prefix: only for text/keyword/wildcard types
GET /prod/_search
{
  "query": {
    "prefix": {
      "tags":  "win"
    }
  }
}

# wildcard
GET /prod/_search
{
  "query": {
    "wildcard": {
      "tags":  "*lco*"
    }
  }
}

Date math

################################################
# date math, date +/- yr, with || separate
GET /prod/_search
{
  "query": {
    "range": {
      "created": {
        "gte": "2001/01/01||-1y"
      }
    }
  }
}
# you can do now alone or have relative as well the math
GET /prod/_search
{
  "query": {
    "range": {
      "created": {
        "gte": "now-1y"
      }
    }
  }
}

Full text Queries

################################################
# Full text Queries
# see fields & datatypes of an index
GET /prod/_mapping

# See results that partially contain the keywords. Because by default, we or the tokens inside the description
GET /prod/_search
{
  "query": {
    "match": {
      "description": "Pellentesque asdfa"
    }
  }
}

# Now see the exact match, by changing that to boolean and
# Note that the order of "at" and "Pellentesque" still doesn't affect the result
GET /prod/_search
{
  "query": {
    "match": {
      "description": {
        "query": "at Pellentesque" ,
        "operator": "and"
      }
    }
  }
}

# Now let's match the exact
GET /prod/_search
{
  "query": {
    "match_phrase": {
      "description": "Pellentesque at"
    }
  }
}

# Search the same term in two fields
GET /prod/_search
{
  "query": {
    "multi_match": {
      "query": "at",
      "fields": ["description", "tags"]
    }
  }
}

========================================================================

Logstash

========================================================================

Logstash if [field] will evaluate false if the field itself is false, or the field doesn't exist. So a hack is:

mutate {
  add_field => { "[@metadata][some_field] =" => "NULL" }
  copy => { "[MSG_FIELD]" => "[@metadata][some_field]"}
}
if [@metadata][some_field] != "NULL" {
  aggregate {
    task_id => "%{host}"
    code => "map['some_field'] = event.get('[MSG_FIELD]')"
    map_action => "create_or_update"
  }
}

========================================================================

Whole System

========================================================================

Metric beat is much easier than filebeat

@statsd.timer("foo.bar")
def foo(self):
    pass
=> becomes field "statsd.foo.bar", metricbeat-<beatversion>-<date>, deleted manually

Logs: (logstash)
- system module logs (preprocessed by filebeat), like ssh history, etc. Bash pipeline cleans up system module formatting; grok
- Input logs: (moxi log), logging pipeline. Aggregating logs together (most processing)
- both pipelines are run at the same time in a single process
Logstash -> elastic search
- a transform in elasticsearch is a constant need for getting stats about
* heartbeat
- heartbeat is going to directly going to Elasticsearch
- SSL?
TLS - overhead?
- ELK has an SSL
- 3 month policy?
- Lifecycle management
elastic can see command keywords bash.command.keyword : * and agent.hostname.keyword : moxi32

elk - RicoJia/notes GitHub Wiki

ELK

Theory

Hands-on

Logstash

Whole System

⚠️ GitHub.com Fallback ⚠️

elk - RicoJia/notes GitHub Wiki

ELK

Theory

Hands-on

Logstash

Whole System

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️