K8S Monitoring - nijov/micro-services GitHub Wiki

https://www.outcoldsolutions.com/docs/monitoring-kubernetes/v5/prometheus/

curl -L http://localhost:2379/metrics | grep -v debugging curl -L http://localhost:2379/health

https://coreos.com/blog/developing-prometheus-alerts-for-etcd.html

etcd metrics API Reference: https://etcd.io/docs/v3.4.0/metrics/

slunk k8s metrics info: https://www.grafanacon.org/2019/presentations/Bob_Cotton_GrafanaCon_2019.pdf

etcd is up and running- if etcd metrics end point is responding, etcd is running

http://127.0.0.1:2379/metrics

etcd_server_has_leader metrics - the etcd cluster must have a leader. Only time it wont respond is when the cluster coming up and first leader is not elected.

if this metrics endpoint show response 0, generate alert.

Unsafe number of peers in the cluster
- should have at least 3 of them running. If there are less than 3 peers running, alert should be generated.
etcd_http_successful_duration_seconds_bucket - use this metrics to check if HTTP requests are responding slow

Average HTTP response time for etcd cluster shall be be above average threashold etcd_http_failed_total etcd_http_successful_duration_seconds_bucket etcd_http_received_total

etcd_http_failed_total and etcd_http_received_total

send a warning if 0.2% of the HTTP requests to etcd fail
curl http://127.0.0.1:2379/v2/keys/somekey_exist
curl http://127.0.0.1:2379/v2/keys/somekey_dont_exist

process_open_fds and process_max_fds - This is typical Prometheus endpoint, this may be available from VM level agent metrics