elk - RicoJia/notes GitHub Wiki

ELK

========================================================================

Theory

========================================================================

  1. Basic architecture: Elasticsearch_component_relation-1024x834

    • shard is a piece of an index. shards of multiple shards can be stored on a single server.
    • see link
  2. push/pull-architecture: pull architecture uses a broker (a message queue) and server will just pull stuff from it. Push is client -> request

  3. Ref: https://pdai.tech/md/db/nosql-es/elasticsearch.html

  4. Roles of nodes

    1. Master node: responsible for creating & deleting indices
      • A node with the role is not automatically the master node, unless there're no other master-eligible nodes
    2. Data node:
      • performing queries
    3. ingest node (absorb):
      • ingest pipeline is a series of steps (processors) performed when indexing docs
      • A simplified logstash pipeline. You can change fields, etc. node.ingest: true
    4. Machine Learning. node.ml. Run machine learning jobs?
    5. Coordination node: Distribute queries and aggregate results
    6. See nodes: GET /_cat/nodes?v
      • dim means data, ingest, master. The first node launched is chosen as the master
  5. sharding is done on index level, not on node/cluster level. Because an index can have arbitrary number of documents. Each node has a disk space limit.

    • each shard is an apache lucene
    • allows parallel search across sharding
    • see sharding: GET /_cat/indices?v. pri means primary shards.
    • after elastic 7, there are 1 default shard. Before there were 5, and that causes oversharding on small indicies
    • has split and shrink API
    • routing: GET /INDEX/_doc/id -> routing(INDEX) -> shard -> primary shard
    • Potential issues: everything is done async, so things can go wrong
      shard A updated, shard B not updated due to network error
      
      1. "primary terms": counter how many times each shard has been changed
      2. Sequency num: number of total write operations on each index
      3. versioning: number that ++ for every modification to a doc. Now there's a better way
  6. replicas: for fault-tolerance.

    • replicas of shards. Number specified at index creation.
    • Primary shard + replicas = 1 replica group
    • So store replica on a different node (machine) from the primary shard, but they can be on the same machine for higher throughput
    • Can take snapshots on indices as well

========================================================================

Hands-on

========================================================================

  1. Set up

    • bin/elasticsearch-create-enrollment-token --scope kibana
      • it's valid every 30 min
    • Install kibana https://www.elastic.co/guide/en/kibana/current/targz.html
    • elastic server is port 9200, kibana is localhost:5601. Go to this one
    • Log into elastic server, generate a new password: bin/elasticsearch-reset-password -u elastic
      • elastic, PASSWORD
  2. Curl queries

    1. Bash query curl --cacert config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -X GET https://localhost:9200
      • you need to provide cacert certificate;
      • -u: user + password
      • -X GET is to get the basic info of the cluster
    2. Query to list all products: curl --cacert config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -X GET -H "Content-Type:application/json" https://localhost:9200/products/_search -d '{ "query": { "match_all": {} } }'
      • Content-Type: how the data is serialized. application/json is a common one
      • products/_search is to search in products index
      • -d means data, which expects you to submit an exact form format
  3. SPin a second node:

    • extract the elastic tar.gz file
    • add this to config/jvm.options.d/custom.options to ALL NODES, including the main
      -Xms1g
      -Xmx1g
      
    • run the main node
    • generate a node level token in the main token shell : ./bin/elasticsearch-create-enrollment-token --scope node
    • go to the second node, change node name in config/elasticsearch.yml, node.name
    • launch the node ./bin/elasticsearch ----enrollment-token <TOKEN>
    • Then by GET /_cluster/health, you should see status as green
    • check GET /_cat/shards?v, then you can see replicas being assigned two 2 nodes
  4. Operations

    # Basics: node, index, shards info
    #_means the cluster name, _ is optional. But you can see which nodes are running there
    GET /_cluster/health
    #cat is CAT format, human readable. nodes will list all nodes running inside the cluster, v is to make it descriptive
    GET /_cat/nodes?v
    # see all indices, including system indices shared with kibana. The ones starting with ```.``` means hidden
    # kibana stores configuration such as dashboard into an index. So when you launch an instance of kibana, that dashboard gets loaded
    GET /_cat/indices?v&expand_wildcards=all
    GET /_cat/indices?v
    GET /_cat/shards?v
    
    # create a new index, which is optional
    # automatically creates 1 primary, 1 replica. 
    # The replica has not been assigned a node yet, so status is yellow (in ```GET /_cat/indices?v```)
    # Can verify that by ```GET /_cat/shards?v```. The primary one has started
    # The kibana shards will automatically add 1 replica once we have > 1 node.
    PUT /pages
    # Create a new document, with default id
    POST /prod/_doc
    {
      "date": "12-02",
      "timestamp": "12:00",
      "msg": "hello"
    }
    # add or replace the existing index='sample_id' document
    PUT /prod/_doc/sample_id
    {
      "date": "12-05",
      "timestamp": "3:00",
      "msg": "helloll"
    }
    # Check the document
    GET /prod/_doc/sample_id
    
    #######################################################################
    # change a field in the document, just do POST with another field. Under the hood, **each document is an immutable**. But elastic made it look like a field has been updated/added
    POST /prod/_doc/sample_id
    {
      "msg": "herrff"
    }
    
    #post the update in a scripted way
    # Note you need ```_update```, and you need ```ctx._source``` to update the field
    POST /prod/_update/sample_id
    {
     "script": {
       "source": "ctx._source.msg = 'balalba'"
      }
      # you can do 
       "source": "ctx._source.field += 1"
       # or even use param
       "source": "ctx._source.field += param"
       "params":{
        "foo": 2
       }
    }
    # Post change only if the primary term and sequence number match. This is for concurrency control
    # Note you need _update
    POST /prod/_update/sample_id?if_primary_term=1&if_seq_no=19
    {
      "doc": {
        "msg": "baz2"
      }
    }
    
    POST /prod/_delete_by_query
    {
      "query": {
          "match_all": {}
      }
    }
    ################################################
    # list all documents under the same index
    # without size, default is 10
    GET /prod/_search?size=100
    {
        "query": {
            "match_all": {}
        }
    }
    # update all matching documents. Note we need to use the script
    # Internally, a snapshot is taken, and all replica groups are searched simutaneously
    # "conflicts" means when there's a version conflict. Proceed means to skip docs in conflict, not the entire index 
    POST /prod/_update_by_query
    {
      "conflicts": "proceed", 
      "script":{
        "source": "ctx._source.msg='sdf'"
      }
    }
    #######################################################################
    # Delete the doc
    DELETE /prod/_doc/sample_id
    
    # Post change only if the primary term and sequence number match
    POST /prod/_update/sample_id?if_primary_term=1&if_seq_no=19
    {
      "doc": {
        "msg": "baz2"
      }
    }
    
    ################################################
    # bulk, which is using ndjson, not json. MUCH MORE EFFICIENT THAN a single POST
    # "create" will fail with version error if the document already exists
    # "index" will replace
    POST /_bulk
    { "index":{ "_index":"prod", "_id":"sample_id" } }
    { "name":"espresso", "price": 499 }
    { "index":{ "_index":"prod", "_id":"sample_id" } }
    { "name":"espresso3", "price": 499 }
    
    # Update one field, using update, doc
    POST /_bulk
    { "update":{ "_index":"prod", "_id":"sample_id" } }
    { "doc": { "name":"espresso44" } }
    
    # can specify index name here
    POST /prod/_bulk
    { "update":{ "_id":"sample_id" } }
    { "doc": { "name":"espresso55" } }
    
  5. Bulk API, text Analysis & mapping

    ################################################
    # Send bulk API 
    # each has a \r\n at the end
    # Last line of the file should be empty
    
    # First part of the command is cacert. Note we are using ndjson. 
    # @ means a file in the current directory, not a path
    curl --cacert ~/third_party_pkgs/elastic_stack/elasticsearch/config/certs/http_ca.crt -u elastic:TZ_dpNUD+goTGKzQ68h_ -H "Content-Type: application/x-ndjson" -XPOST https://localhost:9200/prod/_bulk --data-binary "@products-bulk.json"
    
    ################################################
    # Text Analysis
    # Text processing = character filter + tokenizer (break down a sentence to words in a list) + token filter (to lower case)
    # Outside of the analyzer, token is called "term"
    POST /_analyze
    {
      "text": "I love BEer",
      "analyzer": "standard"
    }
    
    # inverted index is a term. created for each field. Maintained by Apache Lucene
    # Terms are sorted alphabetically, for Relevance Scoring
    # Numbers are stored in BKD tree, for geospatial search
    https://drive.google.com/file/d/1SG7vPlKAqwuQjGVhhmbm0tDwFproSxKV/view?usp=sharing
    # Mapping is structure of the documents: fields, data types
    
    # Datatypes.
    # Object is a JSON object, and they can be nested as well. 
    # But if there's an array of objects, then internally these objects are stored like field_1 [], field_2 []... and the order of each field might be different? So we use array below 
    # Properties are used for storing data types. Then Object is transfored to a valid JSON object for Apache Lucene, e.g., "properties.name"
    DELETE /pages
    GET /pages
    PUT /pages
    {
      "mappings": {
        "properties": {
          "name": { "type": "text" },
          "page_numer":{ "type": "double" },
          "manufacturer": {
            "properties": {
              "name": { "type": "text" }
            }
          }
        }
      }
    }
    
    # nested
    # so each nested type is stored in a separate hidden document on itsown
    PUT /pages
    {
      "mappings": {
        "properties": {
          "name": { "type": "text" },
          "page_numer":{ "type": "double" },
          "manufacturer": { "type": "nested" }
        }
      }
    }
    
    # date - there's a date as well
    
    # keyword
    # search exact values. "text" type will break a sentence down to words in tokenizer, but keyword will not do anything (i.e., the keyword analyzer is noop analyzer) . 
    # useful in aggregation, filter, sorting. E.g., email addresses
    POST /_analyze
    {
      "text": "[email protected]",
      "analyzer": "keyword"
    }
    
    # ECS: uniform fields for common tasks such as logging. Elastic Common Schema
    
    # Type Coercion - The first time we post the field, if you put 7.4, then later all inputs will be "coerced" into floats.
    # i.e, if you put "7.4", it may still be converted to 7.4 (float). But if you
    # have "7.4m", then there'd be trouble
    PUT /pages/_doc/1
    {
      "price_f": "600"
    }
    GET /pages
    
    # Arrays
    # Arrays do not exist. They're just flattened
    # In array, they should be of the same data type. Else, type coercion will come in.
    POST /pages/_doc
    {
      "tags": ["a", "b"]
    }
    
    GET /pages/_search
    
  6. Term-level Querying

    ################################################
    # Term level Query
    # fetch by id
    GET /prod/_search
    {
      "query": {
        "ids": {
          "values": [1,2]
        }
      }
    }
    
    # how to search for partial date? Search by range
    # default date format is 2001/12/31
    GET /prod/_search
    {
      "query": {
        "range": {
          "created": {
            "gte": "01-01-2001",
            "lte": "01-01-2005",
            "format": "dd-MM-yyyy"
          }
    
        }
      }
    }
    
    # Matched to non-null queries
    # "" in Elastic search is NOT null
    # So here, if tags is empty, it will be filtered out
    GET /prod/_search
    {
      "query": {
        "exists": {
          "field": "tags"
        }
      }
    }
    
    # prefix: only for text/keyword/wildcard types
    GET /prod/_search
    {
      "query": {
        "prefix": {
          "tags":  "win"
        }
      }
    }
    
    # wildcard
    GET /prod/_search
    {
      "query": {
        "wildcard": {
          "tags":  "*lco*"
        }
      }
    }
    
  7. Date math

    ################################################
    # date math, date +/- yr, with || separate
    GET /prod/_search
    {
      "query": {
        "range": {
          "created": {
            "gte": "2001/01/01||-1y"
          }
        }
      }
    }
    # you can do now alone or have relative as well the math
    GET /prod/_search
    {
      "query": {
        "range": {
          "created": {
            "gte": "now-1y"
          }
        }
      }
    }
    
    
  8. Full text Queries

    ################################################
    # Full text Queries
    # see fields & datatypes of an index
    GET /prod/_mapping
    
    # See results that partially contain the keywords. Because by default, we or the tokens inside the description
    GET /prod/_search
    {
      "query": {
        "match": {
          "description": "Pellentesque asdfa"
        }
      }
    }
    
    # Now see the exact match, by changing that to boolean and
    # Note that the order of "at" and "Pellentesque" still doesn't affect the result
    GET /prod/_search
    {
      "query": {
        "match": {
          "description": {
            "query": "at Pellentesque" ,
            "operator": "and"
          }
        }
      }
    }
    
    # Now let's match the exact
    GET /prod/_search
    {
      "query": {
        "match_phrase": {
          "description": "Pellentesque at"
        }
      }
    }
    
    # Search the same term in two fields
    GET /prod/_search
    {
      "query": {
        "multi_match": {
          "query": "at",
          "fields": ["description", "tags"]
        }
      }
    }
    

========================================================================

Logstash

========================================================================

  1. Logstash if [field] will evaluate false if the field itself is false, or the field doesn't exist. So a hack is:
    mutate {
      add_field => { "[@metadata][some_field] =" => "NULL" }
      copy => { "[MSG_FIELD]" => "[@metadata][some_field]"}
    }
    if [@metadata][some_field] != "NULL" {
      aggregate {
        task_id => "%{host}"
        code => "map['some_field'] = event.get('[MSG_FIELD]')"
        map_action => "create_or_update"
      }
    }

========================================================================

Whole System

========================================================================

  1. Metric beat is much easier than filebeat
@statsd.timer("foo.bar")
def foo(self):
    pass
=> becomes field "statsd.foo.bar", metricbeat-<beatversion>-<date>, deleted manually
  1. Logs: (logstash)

    • system module logs (preprocessed by filebeat), like ssh history, etc. Bash pipeline cleans up system module formatting; grok
    • Input logs: (moxi log), logging pipeline. Aggregating logs together (most processing)
    • both pipelines are run at the same time in a single process
  2. Logstash -> elastic search

    • a transform in elasticsearch is a constant need for getting stats about
  3. * heartbeat

    • heartbeat is going to directly going to Elasticsearch
    • SSL?
  4. TLS - overhead?

    • ELK has an SSL
    • 3 month policy?
    • Lifecycle management
  5. elastic can see command keywords bash.command.keyword : * and agent.hostname.keyword : moxi32

⚠️ **GitHub.com Fallback** ⚠️