DevOpsMonitoring - henk52/knowledgesharing GitHub Wiki

DevOps Monitoring

Introduction

Purpose

References

  • Data

Installing and configuring

Set-up the host

debugging

  • sudo tcpdump -nn -i eno1 port 5601

ELK

  • sudo vi /etc/sysctl.conf
    • vm.max_map_count=262144
  • docker pull sebp/elk:5615
  • docker run --publish-all=true --name elk docker.io/sebp/elk:5615
  • docker ps
    • 5044 - logstash
    • 5601 - kibana
    • 9200 - elasticsearch - REST
    • 9300 - elasticsearch - for nodes communication
  • docker inspect 8a57a403e858 | grep IPAddress

Grafana

  1. docker pull grafana/grafana
  2. docker run --publish-all=true --name grafana grafana/grafana
  3. connect with browser to grafane
  4. login as admin and pw admin
  5. change the password

Add elasticsearch as source to grafana

  • click add resource
  • HTTP
    • url: http://172.17.0.2:9200
      • The 172.17.0.2 is the docker net ip address
      • the 9200 is the actual exposed port, even if the public port is set to 32772
    • Access: Server
  • Elasticsearch details
    • Index Name: [pipelineinfo-]YYYY.MM.DD
    • Pattern: Daily

Graphite

  1. docker run --publish-all=true -e COLLECTD=1 --name graphite graphiteapp/graphite-statsd

Prometheus

Installing prometheus agents

sudo apt install -y prometheus-haproxy-exporter prometheus-node-exporter prometheus-libvirt-exporter

Installing promethus container

  1. docker pull prom/prometheus

Grafana dashboards for prometheus

  • find grafana dashboards at Dashboards

  • 1806 - node exporter

  • 12693 - haproxy (doesn't show any data)

  • 12538 - libvirt

  • 10530 - smartctl

  • click dashboards

  • click new

  • click import

  • enter the ID

  • click load

  • Select 'prometheus' as the prometheus source

  • click import

Data Observability

Data Observability pillars

See:

  • Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated.
    • Freshness is particularly important when it comes to decision making; after all, stale data is basically synonymous with wasted time and money.
  • Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range.
    • Data distribution gives you insight into whether or not your tables can be trusted based on what can be expected from your data.
  • Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources. If 200 million rows suddenly turns into 5 million, you should know.
  • Schema: Changes in the organization of your data, in other words, schema, often indicates broken data. Monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem.
  • Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it. Good lineage also collects information about the data (also referred to as metadata) that speaks to governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.

Scratchpad

Fluent bit

Fluent bit installation

  • Installing with Helm Chart

  • helm repo add fluent https://fluent.github.io/helm-charts

  • helm search repo fluent

  • helm upgrade --install fluent-bit fluent/fluent-bit

Fluent bit configuration

  • [INPUT]

    • Name - name of the plugin
    • Tag - name og your tag
      • used in e.g. filter
  • [SERVICE] - this is the fluentbit service

  • yaml

    • env
    • services
    • pipeline
      • inputs
      • filters
      • output

Plugins?

  • BackPressure

    • Mem_Buf_Limits
  • Monitoring

    • expose metrics (in json)
      • /api/v1/uptime
      • /api/v1/metrics
        • curl http://10.42.204.112:2020/api/v1/metrics | jq
      • /api/v1/metrics/prometheus - similar to metrics but in the prometheus format
      • /api/v1/health
  • Expect - enable us to validate the data is formated as expected.

  • Fluentbit metrics

  • Health

  • Nginx exporter metrics

  • Node exporter metrics

  • Prometheus scrap metrics

    • Seems like this will do prometheus scrapes TODO investigate
  • StadsD

  • Windows Exporter metrics

    • node exporter for windows?
  • open telemetry input plugin

    • otlp http
    • TCP 4318
  • filter plugins?

    • Expect
    • GeopIP
    • Grep
    • k8s
    • record modifier
    • rewrite tag
    • throttle modify
    • Nest
    • LUA scripts
  • output plugins

    • Prometheus remote write
    • prometheus exporter
    • OpenSearch
    • OpenTelemetry
    • Loki
    • stdout

Calyptia - will visualize the configuration of fluentbit

fluentbit parsers

nested json

{
  "resourceLogs": [
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "logs-basic-example"
            }
          }
        ]
      },
      "scopeLogs": [
        {
          "scope": {
            "name": "opentelemetry-log-appender",
            "version": "0.3.0"
          },
          "logRecords": [
            {
              "timeUnixNano": null,
              "time": null,
              "observedTimeUnixNano": 1712651426487589000,
              "observedTime": "2024-04-09 08:30:26.487",
              "severityNumber": 9,
              "severityText": "INFO",
              "body": {
                "stringValue": "Hello from logs-basic-example"
              },
              "attributes": [],
              "droppedAttributesCount": 0
            }
          ]
        }
      ]
    }
  ]
}

fluentbit filters

[FILTER]
    Name         parser
    Parser       simple_json
    Match        json.*
    Key_Name     msg
    Reserve_Data On
    Preserve_Key On

with filter

{"create":{"_index":"logstash-2024.04.12"}}
{"@timestamp":"2024-04-12T09:10:58.305Z","log":"2024-04-12T09:10:58.305209409Z stderr F {\"level\":\"Info\",\"ts\":1712913058305,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}
{"create":{"_index":"logstash-2024.04.12"}}
{"@timestamp":"2024-04-12T09:11:08.306Z","log":"2024-04-12T09:11:08.305943113Z stderr F {\"level\":\"Info\",\"ts\":1712913068305,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}

without filter

{"@timestamp":"2024-04-12T09:19:08.321Z","log":"2024-04-12T09:19:08.320983428Z stderr F {\"level\":\"Info\",\"ts\":1712913548320,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}

Fluentd

Trouble shooting

Troubleshooting grafana

Grafana getting 502 when trying to connect to elasticsearch

I had to use the IP address on the docker container net and the actual port 9200 not the one on the public network.

http://172.17.0.2:9200

ELK Troubleshooting

vm.max_map_count

  • sudo sysctl -w vm.max_map_count=262144
    • for the elk stack to run

Troubleshooting fluentbit

illegal_argument_exception

Remove or Replace _type Parameter: Since _type is no longer supported, you should either remove it entirely or replace it with an appropriate parameter depending on your Elasticsearch version. In Elasticsearch 7.x and later, documents are stored in a single _doc type by default, so you can simply remove the Type parameter altogether

    [OUTPUT]
        Name es
        Match kube.*
        Host elasticsearch-master
        Suppress_Type_Name on
        Logstash_Format On
        Retry_Limit False
        Type _doc

    [OUTPUT]
        Name es
        Match host.*
        Host elasticsearch-master
        Suppress_Type_Name on
        Logstash_Format On
        Logstash_Prefix node
        Retry_Limit False
        Type _doc
[2024/04/08 09:11:14] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_type]"}],"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_type]"},"status":400}

[2024/04/08 09:11:14] [ warn] [engine] failed to flush chunk '1-1712566940.860717429.flb', retry in 532 seconds: task_id=171, input=tail.0 > output=es.0 (out_id=0)
[2024/04/08 09:11:14] [ warn] [engine] failed to flush chunk '1-1712566838.861255530.flb', retry in 538 seconds: task_id=53, input=tail.0 > output=es.0 (out_id=0)

failed to flush chunk '1-1712829235.737404026.flb', retry in 8 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)

[2024/04/11 10:22:26] [debug] [input chunk] update output instances with new chunk size diff=990, records=1, input=tail.0
[2024/04/11 10:22:26] [debug] [task] created task=0x7fb55a239f20 id=3 OK
[2024/04/11 10:22:26] [debug] [upstream] KA connection #125 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:26] [debug] [output:es:es.0] task_id=3 assigned to thread #0
[2024/04/11 10:22:26] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:26] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:26] [debug] [output:es:es.0] Elasticsearch response
{"errors":false,"took":51,"items":[{"create":{"_index":"logstash-2024.04.11","_id":"vUWuzI4B-yGqFaeXHvYb","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":556031,"_primary_term":2,"status":201}}]}
[2024/04/11 10:22:26] [debug] [upstream] KA connection #125 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:26] [debug] [out flush] cb_destroy coro_id=183
[2024/04/11 10:22:26] [debug] [task] destroy task=0x7fb55a239f20 (task_id=3)
[2024/04/11 10:22:27] [debug] [output:es:es.0] task_id=2 assigned to thread #1
[2024/04/11 10:22:27] [debug] [upstream] KA connection #122 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:27] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:27] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:27] [debug] [upstream] KA connection #122 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:27] [debug] [out flush] cb_destroy coro_id=183
[2024/04/11 10:22:27] [debug] [retry] re-using retry for task_id=2 attempts=2
[2024/04/11 10:22:27] [ warn] [engine] failed to flush chunk '1-1712830937.946819883.flb', retry in 9 seconds: task_id=2, input=tail.0 > output=es.0 (out_id=0)
[2024/04/11 10:22:27] [debug] [input:tail:tail.0] inode=312543, /var/log/containers/vanilla-log-7c7bdb9545-jvrqt_default_vanilla-log-58da14eb59c2e88f5dd1389f42fd80de64c21d14804a5c166d0bca5be163305d.log, events: IN_MODIFY 
[2024/04/11 10:22:27] [debug] [input chunk] update output instances with new chunk size diff=818, records=1, input=tail.0
[2024/04/11 10:22:28] [debug] [task] created task=0x7fb55a239c00 id=3 OK
[2024/04/11 10:22:28] [debug] [upstream] KA connection #123 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:28] [debug] [output:es:es.0] task_id=3 assigned to thread #0
[2024/04/11 10:22:28] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:28] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:28] [debug] [upstream] KA connection #123 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:28] [debug] [out flush] cb_destroy coro_id=184
[2024/04/11 10:22:28] [debug] [retry] new retry created for task_id=3 attempts=1
[2024/04/11 10:22:28] [ warn] [engine] failed to flush chunk '1-1712830947.947063732.flb', retry in 9 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)

add

        Trace_Error       On
        Trace_Output      On

to

    [OUTPUT]
        Name es
        Match kube.*
        Host elasticsearch-master
        Logstash_Format On
        Retry_Limit False
        Trace_Error       On
        Trace_Output      On
        Suppress_Type_Name on
        Type _doc
[2024/04/11 21:10:55] [ info] [input:tail:tail.0] inotify_fs_add(): inode=311564 watch_fd=1 name=/var/log/containers/log-json-simple-7996b6c769-pdnb9_default_log-json-simple-0f06eff0c18da53700b06390f264163aac4652e4ae3acf6de395b1911dbc92f8.log
{"create":{"_index":"node-2024.04.11"}}
{"@timestamp":"2024-04-11T21:10:56.292Z","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_CAP_EFFECTIVE":"1ffffffffff","_SELINUX_CONTEXT":"unconfined\n","_MACHINE_ID":"16f08a93c1904a3191927197e0cfbffb","_HOSTNAME":"worker1","_SYSTEMD_SLICE":"system.slice","_TRANSPORT":"stdout","SYSLOG_IDENTIFIER":"kubelet","_COMM":"kubelet","_EXE":"/usr/bin/kubelet","_CMDLINE":"/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --pod-infra-container-image=registry.k8s.io/pause:3.9","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_SYSTEMD_UNIT":"kubelet.service","_PID":"822","_BOOT_ID":"1f405ae4c27d4f77b6aa322c97c976a7","_STREAM_ID":"c3ffcb16b5db48e0afffcab8690165d4","_SYSTEMD_INVOCATION_ID":"1b8e62ec16bb4d28b770bd2adfd3d194","MESSAGE":"I0411 21:10:56.292107     822 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"default/fluent-bit-bvc68\" podStartSLOduration=4.405401576 podStartE2EDuration=\"7.292045832s\" podCreationTimestamp=\"2024-04-11 21:10:49 +0000 UTC\" firstStartedPulling=\"2024-04-11 21:10:51.470213906 +0000 UTC m=+47501.068491933\" lastFinishedPulling=\"2024-04-11 21:10:54.356858132 +0000 UTC m=+47503.955136189\" observedRunningTime=\"2024-04-11 21:10:56.291283103 +0000 UTC m=+47505.889560930\" watchObservedRunningTime=\"2024-04-11 21:10:56.292045832 +0000 UTC m=+47505.890323619\""}
{"create":{"_index":"logstash-2024.04.11"}}
{"@timestamp":"2024-04-11T21:10:57.820Z","log":"2024-04-11T21:10:57.820318828Z stderr F {\"level\":\"Info\",\"ts\":1712869857819,\"msg\":\"Hello from logs-basic-example\"}","kubernetes":{"pod_name":"log-json-simple-7996b6c769-pdnb9","namespace_name":"default","pod_id":"39c7e953-7648-4d4a-99ee-45ff147e6c69","labels":{"app":"log-json-simple","pod-template-hash":"7996b6c769"},"annotations":{"cni.projectcalico.org/containerID":"1064c90cebfad72b17cb9abee89ed5e1c1965df1cffbe7b6c5ef309ea0fae422","cni.projectcalico.org/podIP":"10.42.235.138/32","cni.projectcalico.org/podIPs":"10.42.235.138/32"},"host":"worker1","container_name":"log-json-simple","docker_id":"0f06eff0c18da53700b06390f264163aac4652e4ae3acf6de395b1911dbc92f8","container_hash":"192.168.1.102:5000/log-json-simple@sha256:2bf291c81781a92c5ad7a5bbbfc7ba80974d928d07a681cd51a8708ff5100687","container_image":"192.168.1.102:5000/log-json-simple:0.1.0"}}
[2024/04/11 21:10:57] [error] [output:es:es.0] error: Output
{"errors":true,"took":0,"items":[{"create":{"_index":"logstash-2024.04.11","_id":"kWD_zo4B-yGqFaeX2nsq","status":400,"error":{"type":"document_parsing_exception","reason":"[1:325] object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value"}}}]}
[2024/04/11 21:10:57] [ warn] [engine] failed to flush chunk '1-1712869857.820601044.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
{
  "@timestamp": "2024-04-11T21:10:56.292Z",
  "PRIORITY": "6",
  "SYSLOG_FACILITY": "3",
  "_UID": "0",
  "_GID": "0",
  "_CAP_EFFECTIVE": "1ffffffffff",
  "_SELINUX_CONTEXT": "unconfined\n",
  "_MACHINE_ID": "16f08a93c1904a3191927197e0cfbffb",
  "_HOSTNAME": "worker1",
  "_SYSTEMD_SLICE": "system.slice",
  "_TRANSPORT": "stdout",
  "SYSLOG_IDENTIFIER": "kubelet",
  "_COMM": "kubelet",
  "_EXE": "/usr/bin/kubelet",
  "_CMDLINE": "/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --pod-infra-container-image=registry.k8s.io/pause:3.9",
  "_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
  "_SYSTEMD_UNIT": "kubelet.service",
  "_PID": "822",
  "_BOOT_ID": "1f405ae4c27d4f77b6aa322c97c976a7",
  "_STREAM_ID": "c3ffcb16b5db48e0afffcab8690165d4",
  "_SYSTEMD_INVOCATION_ID": "1b8e62ec16bb4d28b770bd2adfd3d194",
  "MESSAGE": "I0411 21:10:56.292107     822 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"default/fluent-bit-bvc68\" podStartSLOduration=4.405401576 podStartE2EDuration=\"7.292045832s\" podCreationTimestamp=\"2024-04-11 21:10:49 +0000 UTC\" firstStartedPulling=\"2024-04-11 21:10:51.470213906 +0000 UTC m=+47501.068491933\" lastFinishedPulling=\"2024-04-11 21:10:54.356858132 +0000 UTC m=+47503.955136189\" observedRunningTime=\"2024-04-11 21:10:56.291283103 +0000 UTC m=+47505.889560930\" watchObservedRunningTime=\"2024-04-11 21:10:56.292045832 +0000 UTC m=+47505.890323619\""
}
{
  "errors": true,
  "took": 0,
  "items": [
    {
      "create": {
        "_index": "logstash-2024.04.11",
        "_id": "kWD_zo4B-yGqFaeX2nsq",
        "status": 400,
        "error": {
          "type": "document_parsing_exception",
          "reason": "[1:325] object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value"
        }
      }
    }
  ]
}