DevOpsMonitoring - henk52/knowledgesharing GitHub Wiki

DevOps Monitoring

Introduction

Purpose

References

Data

Installing and configuring

Set-up the host

debugging

sudo tcpdump -nn -i eno1 port 5601

ELK

sudo vi /etc/sysctl.conf
- vm.max_map_count=262144
docker pull sebp/elk:5615
docker run --publish-all=true --name elk docker.io/sebp/elk:5615
docker ps
- 5044 - logstash
- 5601 - kibana
- 9200 - elasticsearch - REST
- 9300 - elasticsearch - for nodes communication
docker inspect 8a57a403e858 | grep IPAddress

Grafana

docker pull grafana/grafana
docker run --publish-all=true --name grafana grafana/grafana
connect with browser to grafane
login as admin and pw admin
change the password

Add elasticsearch as source to grafana

click add resource
HTTP
- url: http://172.17.0.2:9200
  - The 172.17.0.2 is the docker net ip address
  - the 9200 is the actual exposed port, even if the public port is set to 32772
- Access: Server
Elasticsearch details
- Index Name: [pipelineinfo-]YYYY.MM.DD
- Pattern: Daily

Graphite

docker run --publish-all=true -e COLLECTD=1 --name graphite graphiteapp/graphite-statsd

Prometheus

How to Build a Scalable Prometheus Architecture

Installing prometheus agents

sudo apt install -y prometheus-haproxy-exporter prometheus-node-exporter prometheus-libvirt-exporter

Installing promethus container

docker pull prom/prometheus

Grafana dashboards for prometheus

find grafana dashboards at Dashboards
1806 - node exporter
12693 - haproxy (doesn't show any data)
12538 - libvirt
10530 - smartctl
click dashboards
click new
click import
enter the ID
click load
Select 'prometheus' as the prometheus source
click import

Data Observability

Data Observability pillars

See:

Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated.
- Freshness is particularly important when it comes to decision making; after all, stale data is basically synonymous with wasted time and money.
Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range.
- Data distribution gives you insight into whether or not your tables can be trusted based on what can be expected from your data.
Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources. If 200 million rows suddenly turns into 5 million, you should know.
Schema: Changes in the organization of your data, in other words, schema, often indicates broken data. Monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem.
Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it. Good lineage also collects information about the data (also referred to as metadata) that speaks to governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.

Open telemetry

Traces
Context - (seems to be trace_id and span-id)
Span - start and end of a "function" call (I think)
Span Context - the part of a span that is serialized and propagated alongside Distributed Context and Baggage.
- trace-id, span-id, trace flags, trace state.
span-id -
trace-id - identify the call (I think)
Tracer Provider -
parent-id - parent span id.
Trace Exporters - send traces to a consumer, e.g. OpenTelemetry Collector
Context Propagation - With Context Propagation, Spans can be correlated with each other and assembled into a trace, regardless of where Spans are generated.
Trace Semantic Conventions
Span Events - seems to be a singular event, compared to a span that has a start and a finish.
- If the timestamp in which the operation completes is meaningful or relevant, attach the data to a span event.
- If the timestamp isn’t meaningful, attach the data as span attributes.
Span Status - an additional operation is queued to be executed, but its execution is asynchronous
Span Kind -

Fluent bit

Fluentbit , the Telemetry Agent - with Eduardo Silva

Fluent bit installation

Installing with Helm Chart
helm repo add fluent https://fluent.github.io/helm-charts
helm search repo fluent
helm upgrade --install fluent-bit fluent/fluent-bit

Fluent bit configuration

[INPUT]
- Name - name of the plugin
- Tag - name og your tag
  - used in e.g. filter
[SERVICE] - this is the fluentbit service
yaml
- env
- services
- pipeline
  - inputs
  - filters
  - output

Plugins?

BackPressure
- Mem_Buf_Limits
Monitoring
- expose metrics (in json)
  - /api/v1/uptime
  - /api/v1/metrics
    - curl http://10.42.204.112:2020/api/v1/metrics | jq
  - /api/v1/metrics/prometheus - similar to metrics but in the prometheus format
  - /api/v1/health
Expect - enable us to validate the data is formated as expected.
Fluentbit metrics
Health
Nginx exporter metrics
Node exporter metrics
Prometheus scrap metrics
- Seems like this will do prometheus scrapes TODO investigate
StadsD
Windows Exporter metrics
- node exporter for windows?
open telemetry input plugin
- otlp http
- TCP 4318
filter plugins?
- Expect
- GeopIP
- Grep
- k8s
- record modifier
- rewrite tag
- throttle modify
- Nest
- LUA scripts
output plugins
- Prometheus remote write
- prometheus exporter
- OpenSearch
- OpenTelemetry
- Loki
- stdout

Calyptia - will visualize the configuration of fluentbit

fluentbit parsers

nested json

{
  "resourceLogs": [
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "logs-basic-example"
            }
          }
        ]
      },
      "scopeLogs": [
        {
          "scope": {
            "name": "opentelemetry-log-appender",
            "version": "0.3.0"
          },
          "logRecords": [
            {
              "timeUnixNano": null,
              "time": null,
              "observedTimeUnixNano": 1712651426487589000,
              "observedTime": "2024-04-09 08:30:26.487",
              "severityNumber": 9,
              "severityText": "INFO",
              "body": {
                "stringValue": "Hello from logs-basic-example"
              },
              "attributes": [],
              "droppedAttributesCount": 0
            }
          ]
        }
      ]
    }
  ]
}

fluentbit filters

[FILTER]
    Name         parser
    Parser       simple_json
    Match        json.*
    Key_Name     msg
    Reserve_Data On
    Preserve_Key On

with filter

{"create":{"_index":"logstash-2024.04.12"}}
{"@timestamp":"2024-04-12T09:10:58.305Z","log":"2024-04-12T09:10:58.305209409Z stderr F {\"level\":\"Info\",\"ts\":1712913058305,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}
{"create":{"_index":"logstash-2024.04.12"}}
{"@timestamp":"2024-04-12T09:11:08.306Z","log":"2024-04-12T09:11:08.305943113Z stderr F {\"level\":\"Info\",\"ts\":1712913068305,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}

without filter

{"@timestamp":"2024-04-12T09:19:08.321Z","log":"2024-04-12T09:19:08.320983428Z stderr F {\"level\":\"Info\",\"ts\":1712913548320,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}

Fluentd

Put prometheus, elk and grafana on the same server(or a load balancer)
How-To: Set up Fluentd, Elastic search and Kibana in Kubernetes

Trouble shooting

Troubleshooting grafana

Grafana getting 502 when trying to connect to elasticsearch

I had to use the IP address on the docker container net and the actual port 9200 not the one on the public network.

http://172.17.0.2:9200

ELK Troubleshooting

vm.max_map_count

sudo sysctl -w vm.max_map_count=262144
- for the elk stack to run

Troubleshooting fluentbit

illegal_argument_exception

Remove or Replace _type Parameter: Since type is no longer supported, you should either remove it entirely or replace it with an appropriate parameter depending on your Elasticsearch version. In Elasticsearch 7.x and later, documents are stored in a single_doc type by default, so you can simply remove the Type parameter altogether

    [OUTPUT]
        Name es
        Match kube.*
        Host elasticsearch-master
        Suppress_Type_Name on
        Logstash_Format On
        Retry_Limit False
        Type _doc

    [OUTPUT]
        Name es
        Match host.*
        Host elasticsearch-master
        Suppress_Type_Name on
        Logstash_Format On
        Logstash_Prefix node
        Retry_Limit False
        Type _doc

[2024/04/08 09:11:14] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_type]"}],"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_type]"},"status":400}

[2024/04/08 09:11:14] [ warn] [engine] failed to flush chunk '1-1712566940.860717429.flb', retry in 532 seconds: task_id=171, input=tail.0 > output=es.0 (out_id=0)
[2024/04/08 09:11:14] [ warn] [engine] failed to flush chunk '1-1712566838.861255530.flb', retry in 538 seconds: task_id=53, input=tail.0 > output=es.0 (out_id=0)

failed to flush chunk '1-1712829235.737404026.flb', retry in 8 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)

[2024/04/11 10:22:26] [debug] [input chunk] update output instances with new chunk size diff=990, records=1, input=tail.0
[2024/04/11 10:22:26] [debug] [task] created task=0x7fb55a239f20 id=3 OK
[2024/04/11 10:22:26] [debug] [upstream] KA connection #125 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:26] [debug] [output:es:es.0] task_id=3 assigned to thread #0
[2024/04/11 10:22:26] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:26] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:26] [debug] [output:es:es.0] Elasticsearch response
{"errors":false,"took":51,"items":[{"create":{"_index":"logstash-2024.04.11","_id":"vUWuzI4B-yGqFaeXHvYb","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":556031,"_primary_term":2,"status":201}}]}
[2024/04/11 10:22:26] [debug] [upstream] KA connection #125 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:26] [debug] [out flush] cb_destroy coro_id=183
[2024/04/11 10:22:26] [debug] [task] destroy task=0x7fb55a239f20 (task_id=3)
[2024/04/11 10:22:27] [debug] [output:es:es.0] task_id=2 assigned to thread #1
[2024/04/11 10:22:27] [debug] [upstream] KA connection #122 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:27] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:27] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:27] [debug] [upstream] KA connection #122 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:27] [debug] [out flush] cb_destroy coro_id=183
[2024/04/11 10:22:27] [debug] [retry] re-using retry for task_id=2 attempts=2
[2024/04/11 10:22:27] [ warn] [engine] failed to flush chunk '1-1712830937.946819883.flb', retry in 9 seconds: task_id=2, input=tail.0 > output=es.0 (out_id=0)
[2024/04/11 10:22:27] [debug] [input:tail:tail.0] inode=312543, /var/log/containers/vanilla-log-7c7bdb9545-jvrqt_default_vanilla-log-58da14eb59c2e88f5dd1389f42fd80de64c21d14804a5c166d0bca5be163305d.log, events: IN_MODIFY 
[2024/04/11 10:22:27] [debug] [input chunk] update output instances with new chunk size diff=818, records=1, input=tail.0
[2024/04/11 10:22:28] [debug] [task] created task=0x7fb55a239c00 id=3 OK
[2024/04/11 10:22:28] [debug] [upstream] KA connection #123 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:28] [debug] [output:es:es.0] task_id=3 assigned to thread #0
[2024/04/11 10:22:28] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:28] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:28] [debug] [upstream] KA connection #123 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:28] [debug] [out flush] cb_destroy coro_id=184
[2024/04/11 10:22:28] [debug] [retry] new retry created for task_id=3 attempts=1
[2024/04/11 10:22:28] [ warn] [engine] failed to flush chunk '1-1712830947.947063732.flb', retry in 9 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)

Why I get the "failed to flush chunk" error in fluent-bit?

add

        Trace_Error       On
        Trace_Output      On

    [OUTPUT]
        Name es
        Match kube.*
        Host elasticsearch-master
        Logstash_Format On
        Retry_Limit False
        Trace_Error       On
        Trace_Output      On
        Suppress_Type_Name on
        Type _doc

[2024/04/11 21:10:55] [ info] [input:tail:tail.0] inotify_fs_add(): inode=311564 watch_fd=1 name=/var/log/containers/log-json-simple-7996b6c769-pdnb9_default_log-json-simple-0f06eff0c18da53700b06390f264163aac4652e4ae3acf6de395b1911dbc92f8.log
{"create":{"_index":"node-2024.04.11"}}
{"@timestamp":"2024-04-11T21:10:56.292Z","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_CAP_EFFECTIVE":"1ffffffffff","_SELINUX_CONTEXT":"unconfined\n","_MACHINE_ID":"16f08a93c1904a3191927197e0cfbffb","_HOSTNAME":"worker1","_SYSTEMD_SLICE":"system.slice","_TRANSPORT":"stdout","SYSLOG_IDENTIFIER":"kubelet","_COMM":"kubelet","_EXE":"/usr/bin/kubelet","_CMDLINE":"/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --pod-infra-container-image=registry.k8s.io/pause:3.9","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_SYSTEMD_UNIT":"kubelet.service","_PID":"822","_BOOT_ID":"1f405ae4c27d4f77b6aa322c97c976a7","_STREAM_ID":"c3ffcb16b5db48e0afffcab8690165d4","_SYSTEMD_INVOCATION_ID":"1b8e62ec16bb4d28b770bd2adfd3d194","MESSAGE":"I0411 21:10:56.292107     822 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"default/fluent-bit-bvc68\" podStartSLOduration=4.405401576 podStartE2EDuration=\"7.292045832s\" podCreationTimestamp=\"2024-04-11 21:10:49 +0000 UTC\" firstStartedPulling=\"2024-04-11 21:10:51.470213906 +0000 UTC m=+47501.068491933\" lastFinishedPulling=\"2024-04-11 21:10:54.356858132 +0000 UTC m=+47503.955136189\" observedRunningTime=\"2024-04-11 21:10:56.291283103 +0000 UTC m=+47505.889560930\" watchObservedRunningTime=\"2024-04-11 21:10:56.292045832 +0000 UTC m=+47505.890323619\""}
{"create":{"_index":"logstash-2024.04.11"}}
{"@timestamp":"2024-04-11T21:10:57.820Z","log":"2024-04-11T21:10:57.820318828Z stderr F {\"level\":\"Info\",\"ts\":1712869857819,\"msg\":\"Hello from logs-basic-example\"}","kubernetes":{"pod_name":"log-json-simple-7996b6c769-pdnb9","namespace_name":"default","pod_id":"39c7e953-7648-4d4a-99ee-45ff147e6c69","labels":{"app":"log-json-simple","pod-template-hash":"7996b6c769"},"annotations":{"cni.projectcalico.org/containerID":"1064c90cebfad72b17cb9abee89ed5e1c1965df1cffbe7b6c5ef309ea0fae422","cni.projectcalico.org/podIP":"10.42.235.138/32","cni.projectcalico.org/podIPs":"10.42.235.138/32"},"host":"worker1","container_name":"log-json-simple","docker_id":"0f06eff0c18da53700b06390f264163aac4652e4ae3acf6de395b1911dbc92f8","container_hash":"192.168.1.102:5000/log-json-simple@sha256:2bf291c81781a92c5ad7a5bbbfc7ba80974d928d07a681cd51a8708ff5100687","container_image":"192.168.1.102:5000/log-json-simple:0.1.0"}}
[2024/04/11 21:10:57] [error] [output:es:es.0] error: Output
{"errors":true,"took":0,"items":[{"create":{"_index":"logstash-2024.04.11","_id":"kWD_zo4B-yGqFaeX2nsq","status":400,"error":{"type":"document_parsing_exception","reason":"[1:325] object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value"}}}]}
[2024/04/11 21:10:57] [ warn] [engine] failed to flush chunk '1-1712869857.820601044.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)

{
  "@timestamp": "2024-04-11T21:10:56.292Z",
  "PRIORITY": "6",
  "SYSLOG_FACILITY": "3",
  "_UID": "0",
  "_GID": "0",
  "_CAP_EFFECTIVE": "1ffffffffff",
  "_SELINUX_CONTEXT": "unconfined\n",
  "_MACHINE_ID": "16f08a93c1904a3191927197e0cfbffb",
  "_HOSTNAME": "worker1",
  "_SYSTEMD_SLICE": "system.slice",
  "_TRANSPORT": "stdout",
  "SYSLOG_IDENTIFIER": "kubelet",
  "_COMM": "kubelet",
  "_EXE": "/usr/bin/kubelet",
  "_CMDLINE": "/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --pod-infra-container-image=registry.k8s.io/pause:3.9",
  "_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
  "_SYSTEMD_UNIT": "kubelet.service",
  "_PID": "822",
  "_BOOT_ID": "1f405ae4c27d4f77b6aa322c97c976a7",
  "_STREAM_ID": "c3ffcb16b5db48e0afffcab8690165d4",
  "_SYSTEMD_INVOCATION_ID": "1b8e62ec16bb4d28b770bd2adfd3d194",
  "MESSAGE": "I0411 21:10:56.292107     822 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"default/fluent-bit-bvc68\" podStartSLOduration=4.405401576 podStartE2EDuration=\"7.292045832s\" podCreationTimestamp=\"2024-04-11 21:10:49 +0000 UTC\" firstStartedPulling=\"2024-04-11 21:10:51.470213906 +0000 UTC m=+47501.068491933\" lastFinishedPulling=\"2024-04-11 21:10:54.356858132 +0000 UTC m=+47503.955136189\" observedRunningTime=\"2024-04-11 21:10:56.291283103 +0000 UTC m=+47505.889560930\" watchObservedRunningTime=\"2024-04-11 21:10:56.292045832 +0000 UTC m=+47505.890323619\""
}

{
  "errors": true,
  "took": 0,
  "items": [
    {
      "create": {
        "_index": "logstash-2024.04.11",
        "_id": "kWD_zo4B-yGqFaeX2nsq",
        "status": 400,
        "error": {
          "type": "document_parsing_exception",
          "reason": "[1:325] object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value"
        }
      }
    }
  ]
}