DevOpsMonitoring - henk52/knowledgesharing GitHub Wiki
DevOps Monitoring
Introduction
Purpose
References
Installing and configuring
Set-up the host
debugging
- sudo tcpdump -nn -i eno1 port 5601
ELK
- sudo vi /etc/sysctl.conf
- vm.max_map_count=262144
- docker pull sebp/elk:5615
- docker run --publish-all=true --name elk docker.io/sebp/elk:5615
- docker ps
- 5044 - logstash
- 5601 - kibana
- 9200 - elasticsearch - REST
- 9300 - elasticsearch - for nodes communication
- docker inspect 8a57a403e858 | grep IPAddress
Grafana
- docker pull grafana/grafana
- docker run --publish-all=true --name grafana grafana/grafana
- connect with browser to grafane
- login as admin and pw admin
- change the password
Add elasticsearch as source to grafana
- click add resource
- HTTP
- url:
http://172.17.0.2:9200
- The 172.17.0.2 is the docker net ip address
- the 9200 is the actual exposed port, even if the public port is set to 32772
- Access: Server
- url:
- Elasticsearch details
- Index Name: [pipelineinfo-]YYYY.MM.DD
- Pattern: Daily
Graphite
- docker run --publish-all=true -e COLLECTD=1 --name graphite graphiteapp/graphite-statsd
Prometheus
Installing prometheus agents
sudo apt install -y prometheus-haproxy-exporter prometheus-node-exporter prometheus-libvirt-exporter
Installing promethus container
- docker pull prom/prometheus
Grafana dashboards for prometheus
-
find grafana dashboards at Dashboards
-
1806 - node exporter
-
12693 - haproxy (doesn't show any data)
-
12538 - libvirt
-
10530 - smartctl
-
click dashboards
-
click new
-
click import
-
enter the ID
-
click load
-
Select 'prometheus' as the prometheus source
-
click import
Data Observability
Data Observability pillars
- Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated.
- Freshness is particularly important when it comes to decision making; after all, stale data is basically synonymous with wasted time and money.
- Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range.
- Data distribution gives you insight into whether or not your tables can be trusted based on what can be expected from your data.
- Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources. If 200 million rows suddenly turns into 5 million, you should know.
- Schema: Changes in the organization of your data, in other words, schema, often indicates broken data. Monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem.
- Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it. Good lineage also collects information about the data (also referred to as metadata) that speaks to governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.
Scratchpad
Fluent bit
Fluent bit installation
-
helm repo add fluent https://fluent.github.io/helm-charts
-
helm search repo fluent
-
helm upgrade --install fluent-bit fluent/fluent-bit
Fluent bit configuration
-
[INPUT]
- Name - name of the plugin
- Tag - name og your tag
- used in e.g. filter
-
[SERVICE] - this is the fluentbit service
-
yaml
- env
- services
- pipeline
- inputs
- filters
- output
Plugins?
-
BackPressure
- Mem_Buf_Limits
-
Monitoring
- expose metrics (in json)
- /api/v1/uptime
- /api/v1/metrics
curl http://10.42.204.112:2020/api/v1/metrics | jq
- /api/v1/metrics/prometheus - similar to metrics but in the prometheus format
- /api/v1/health
- expose metrics (in json)
-
Expect - enable us to validate the data is formated as expected.
-
Fluentbit metrics
-
Health
-
Nginx exporter metrics
-
Node exporter metrics
-
Prometheus scrap metrics
- Seems like this will do prometheus scrapes TODO investigate
-
StadsD
-
Windows Exporter metrics
- node exporter for windows?
-
open telemetry input plugin
- otlp http
- TCP 4318
-
filter plugins?
- Expect
- GeopIP
- Grep
- k8s
- record modifier
- rewrite tag
- throttle modify
- Nest
- LUA scripts
-
output plugins
- Prometheus remote write
- prometheus exporter
- OpenSearch
- OpenTelemetry
- Loki
- stdout
Calyptia - will visualize the configuration of fluentbit
fluentbit parsers
nested json
{
"resourceLogs": [
{
"resource": {
"attributes": [
{
"key": "service.name",
"value": {
"stringValue": "logs-basic-example"
}
}
]
},
"scopeLogs": [
{
"scope": {
"name": "opentelemetry-log-appender",
"version": "0.3.0"
},
"logRecords": [
{
"timeUnixNano": null,
"time": null,
"observedTimeUnixNano": 1712651426487589000,
"observedTime": "2024-04-09 08:30:26.487",
"severityNumber": 9,
"severityText": "INFO",
"body": {
"stringValue": "Hello from logs-basic-example"
},
"attributes": [],
"droppedAttributesCount": 0
}
]
}
]
}
]
}
fluentbit filters
[FILTER]
Name parser
Parser simple_json
Match json.*
Key_Name msg
Reserve_Data On
Preserve_Key On
with filter
{"create":{"_index":"logstash-2024.04.12"}}
{"@timestamp":"2024-04-12T09:10:58.305Z","log":"2024-04-12T09:10:58.305209409Z stderr F {\"level\":\"Info\",\"ts\":1712913058305,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}
{"create":{"_index":"logstash-2024.04.12"}}
{"@timestamp":"2024-04-12T09:11:08.306Z","log":"2024-04-12T09:11:08.305943113Z stderr F {\"level\":\"Info\",\"ts\":1712913068305,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}
without filter
{"@timestamp":"2024-04-12T09:19:08.321Z","log":"2024-04-12T09:19:08.320983428Z stderr F {\"level\":\"Info\",\"ts\":1712913548320,\"msg\":\"{ \\\"filename\\\": \\\"src/main.rs\\\", \\\"line\\\": 34, \\\"data\\\": \\\"time info\\\" }\"}"}
Fluentd
-
Put prometheus, elk and grafana on the same server(or a load balancer)
-
How-To: Set up Fluentd, Elastic search and Kibana in Kubernetes
Trouble shooting
Troubleshooting grafana
Grafana getting 502 when trying to connect to elasticsearch
I had to use the IP address on the docker container net and the actual port 9200 not the one on the public network.
http://172.17.0.2:9200
ELK Troubleshooting
vm.max_map_count
Troubleshooting fluentbit
illegal_argument_exception
Remove or Replace _type Parameter: Since _type is no longer supported, you should either remove it entirely or replace it with an appropriate parameter depending on your Elasticsearch version. In Elasticsearch 7.x and later, documents are stored in a single _doc type by default, so you can simply remove the Type parameter altogether
[OUTPUT]
Name es
Match kube.*
Host elasticsearch-master
Suppress_Type_Name on
Logstash_Format On
Retry_Limit False
Type _doc
[OUTPUT]
Name es
Match host.*
Host elasticsearch-master
Suppress_Type_Name on
Logstash_Format On
Logstash_Prefix node
Retry_Limit False
Type _doc
[2024/04/08 09:11:14] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_type]"}],"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_type]"},"status":400}
[2024/04/08 09:11:14] [ warn] [engine] failed to flush chunk '1-1712566940.860717429.flb', retry in 532 seconds: task_id=171, input=tail.0 > output=es.0 (out_id=0)
[2024/04/08 09:11:14] [ warn] [engine] failed to flush chunk '1-1712566838.861255530.flb', retry in 538 seconds: task_id=53, input=tail.0 > output=es.0 (out_id=0)
failed to flush chunk '1-1712829235.737404026.flb', retry in 8 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)
[2024/04/11 10:22:26] [debug] [input chunk] update output instances with new chunk size diff=990, records=1, input=tail.0
[2024/04/11 10:22:26] [debug] [task] created task=0x7fb55a239f20 id=3 OK
[2024/04/11 10:22:26] [debug] [upstream] KA connection #125 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:26] [debug] [output:es:es.0] task_id=3 assigned to thread #0
[2024/04/11 10:22:26] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:26] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:26] [debug] [output:es:es.0] Elasticsearch response
{"errors":false,"took":51,"items":[{"create":{"_index":"logstash-2024.04.11","_id":"vUWuzI4B-yGqFaeXHvYb","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":556031,"_primary_term":2,"status":201}}]}
[2024/04/11 10:22:26] [debug] [upstream] KA connection #125 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:26] [debug] [out flush] cb_destroy coro_id=183
[2024/04/11 10:22:26] [debug] [task] destroy task=0x7fb55a239f20 (task_id=3)
[2024/04/11 10:22:27] [debug] [output:es:es.0] task_id=2 assigned to thread #1
[2024/04/11 10:22:27] [debug] [upstream] KA connection #122 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:27] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:27] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:27] [debug] [upstream] KA connection #122 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:27] [debug] [out flush] cb_destroy coro_id=183
[2024/04/11 10:22:27] [debug] [retry] re-using retry for task_id=2 attempts=2
[2024/04/11 10:22:27] [ warn] [engine] failed to flush chunk '1-1712830937.946819883.flb', retry in 9 seconds: task_id=2, input=tail.0 > output=es.0 (out_id=0)
[2024/04/11 10:22:27] [debug] [input:tail:tail.0] inode=312543, /var/log/containers/vanilla-log-7c7bdb9545-jvrqt_default_vanilla-log-58da14eb59c2e88f5dd1389f42fd80de64c21d14804a5c166d0bca5be163305d.log, events: IN_MODIFY
[2024/04/11 10:22:27] [debug] [input chunk] update output instances with new chunk size diff=818, records=1, input=tail.0
[2024/04/11 10:22:28] [debug] [task] created task=0x7fb55a239c00 id=3 OK
[2024/04/11 10:22:28] [debug] [upstream] KA connection #123 to elasticsearch-master:9200 has been assigned (recycled)
[2024/04/11 10:22:28] [debug] [output:es:es.0] task_id=3 assigned to thread #0
[2024/04/11 10:22:28] [debug] [http_client] not using http_proxy for header
[2024/04/11 10:22:28] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2024/04/11 10:22:28] [debug] [upstream] KA connection #123 to elasticsearch-master:9200 is now available
[2024/04/11 10:22:28] [debug] [out flush] cb_destroy coro_id=184
[2024/04/11 10:22:28] [debug] [retry] new retry created for task_id=3 attempts=1
[2024/04/11 10:22:28] [ warn] [engine] failed to flush chunk '1-1712830947.947063732.flb', retry in 9 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
add
Trace_Error On
Trace_Output On
to
[OUTPUT]
Name es
Match kube.*
Host elasticsearch-master
Logstash_Format On
Retry_Limit False
Trace_Error On
Trace_Output On
Suppress_Type_Name on
Type _doc
[2024/04/11 21:10:55] [ info] [input:tail:tail.0] inotify_fs_add(): inode=311564 watch_fd=1 name=/var/log/containers/log-json-simple-7996b6c769-pdnb9_default_log-json-simple-0f06eff0c18da53700b06390f264163aac4652e4ae3acf6de395b1911dbc92f8.log
{"create":{"_index":"node-2024.04.11"}}
{"@timestamp":"2024-04-11T21:10:56.292Z","PRIORITY":"6","SYSLOG_FACILITY":"3","_UID":"0","_GID":"0","_CAP_EFFECTIVE":"1ffffffffff","_SELINUX_CONTEXT":"unconfined\n","_MACHINE_ID":"16f08a93c1904a3191927197e0cfbffb","_HOSTNAME":"worker1","_SYSTEMD_SLICE":"system.slice","_TRANSPORT":"stdout","SYSLOG_IDENTIFIER":"kubelet","_COMM":"kubelet","_EXE":"/usr/bin/kubelet","_CMDLINE":"/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --pod-infra-container-image=registry.k8s.io/pause:3.9","_SYSTEMD_CGROUP":"/system.slice/kubelet.service","_SYSTEMD_UNIT":"kubelet.service","_PID":"822","_BOOT_ID":"1f405ae4c27d4f77b6aa322c97c976a7","_STREAM_ID":"c3ffcb16b5db48e0afffcab8690165d4","_SYSTEMD_INVOCATION_ID":"1b8e62ec16bb4d28b770bd2adfd3d194","MESSAGE":"I0411 21:10:56.292107 822 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"default/fluent-bit-bvc68\" podStartSLOduration=4.405401576 podStartE2EDuration=\"7.292045832s\" podCreationTimestamp=\"2024-04-11 21:10:49 +0000 UTC\" firstStartedPulling=\"2024-04-11 21:10:51.470213906 +0000 UTC m=+47501.068491933\" lastFinishedPulling=\"2024-04-11 21:10:54.356858132 +0000 UTC m=+47503.955136189\" observedRunningTime=\"2024-04-11 21:10:56.291283103 +0000 UTC m=+47505.889560930\" watchObservedRunningTime=\"2024-04-11 21:10:56.292045832 +0000 UTC m=+47505.890323619\""}
{"create":{"_index":"logstash-2024.04.11"}}
{"@timestamp":"2024-04-11T21:10:57.820Z","log":"2024-04-11T21:10:57.820318828Z stderr F {\"level\":\"Info\",\"ts\":1712869857819,\"msg\":\"Hello from logs-basic-example\"}","kubernetes":{"pod_name":"log-json-simple-7996b6c769-pdnb9","namespace_name":"default","pod_id":"39c7e953-7648-4d4a-99ee-45ff147e6c69","labels":{"app":"log-json-simple","pod-template-hash":"7996b6c769"},"annotations":{"cni.projectcalico.org/containerID":"1064c90cebfad72b17cb9abee89ed5e1c1965df1cffbe7b6c5ef309ea0fae422","cni.projectcalico.org/podIP":"10.42.235.138/32","cni.projectcalico.org/podIPs":"10.42.235.138/32"},"host":"worker1","container_name":"log-json-simple","docker_id":"0f06eff0c18da53700b06390f264163aac4652e4ae3acf6de395b1911dbc92f8","container_hash":"192.168.1.102:5000/log-json-simple@sha256:2bf291c81781a92c5ad7a5bbbfc7ba80974d928d07a681cd51a8708ff5100687","container_image":"192.168.1.102:5000/log-json-simple:0.1.0"}}
[2024/04/11 21:10:57] [error] [output:es:es.0] error: Output
{"errors":true,"took":0,"items":[{"create":{"_index":"logstash-2024.04.11","_id":"kWD_zo4B-yGqFaeX2nsq","status":400,"error":{"type":"document_parsing_exception","reason":"[1:325] object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value"}}}]}
[2024/04/11 21:10:57] [ warn] [engine] failed to flush chunk '1-1712869857.820601044.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
{
"@timestamp": "2024-04-11T21:10:56.292Z",
"PRIORITY": "6",
"SYSLOG_FACILITY": "3",
"_UID": "0",
"_GID": "0",
"_CAP_EFFECTIVE": "1ffffffffff",
"_SELINUX_CONTEXT": "unconfined\n",
"_MACHINE_ID": "16f08a93c1904a3191927197e0cfbffb",
"_HOSTNAME": "worker1",
"_SYSTEMD_SLICE": "system.slice",
"_TRANSPORT": "stdout",
"SYSLOG_IDENTIFIER": "kubelet",
"_COMM": "kubelet",
"_EXE": "/usr/bin/kubelet",
"_CMDLINE": "/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock --pod-infra-container-image=registry.k8s.io/pause:3.9",
"_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
"_SYSTEMD_UNIT": "kubelet.service",
"_PID": "822",
"_BOOT_ID": "1f405ae4c27d4f77b6aa322c97c976a7",
"_STREAM_ID": "c3ffcb16b5db48e0afffcab8690165d4",
"_SYSTEMD_INVOCATION_ID": "1b8e62ec16bb4d28b770bd2adfd3d194",
"MESSAGE": "I0411 21:10:56.292107 822 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"default/fluent-bit-bvc68\" podStartSLOduration=4.405401576 podStartE2EDuration=\"7.292045832s\" podCreationTimestamp=\"2024-04-11 21:10:49 +0000 UTC\" firstStartedPulling=\"2024-04-11 21:10:51.470213906 +0000 UTC m=+47501.068491933\" lastFinishedPulling=\"2024-04-11 21:10:54.356858132 +0000 UTC m=+47503.955136189\" observedRunningTime=\"2024-04-11 21:10:56.291283103 +0000 UTC m=+47505.889560930\" watchObservedRunningTime=\"2024-04-11 21:10:56.292045832 +0000 UTC m=+47505.890323619\""
}
{
"errors": true,
"took": 0,
"items": [
{
"create": {
"_index": "logstash-2024.04.11",
"_id": "kWD_zo4B-yGqFaeX2nsq",
"status": 400,
"error": {
"type": "document_parsing_exception",
"reason": "[1:325] object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value"
}
}
}
]
}