K8S Monitoring - nijov/micro-services GitHub Wiki
https://www.outcoldsolutions.com/docs/monitoring-kubernetes/v5/prometheus/
curl -L http://localhost:2379/metrics | grep -v debugging curl -L http://localhost:2379/health
https://coreos.com/blog/developing-prometheus-alerts-for-etcd.html
etcd metrics API Reference: https://etcd.io/docs/v3.4.0/metrics/
slunk k8s metrics info: https://www.grafanacon.org/2019/presentations/Bob_Cotton_GrafanaCon_2019.pdf
- etcd is up and running- if etcd metrics end point is responding, etcd is running
- etcd_server_has_leader metrics - the etcd cluster must have a leader. Only time it wont respond is when the cluster coming up and first leader is not elected.
- if this metrics endpoint show response 0, generate alert.
-
Unsafe number of peers in the cluster
- should have at least 3 of them running. If there are less than 3 peers running, alert should be generated.
-
etcd_http_successful_duration_seconds_bucket - use this metrics to check if HTTP requests are responding slow
- Average HTTP response time for etcd cluster shall be be above average threashold etcd_http_failed_total etcd_http_successful_duration_seconds_bucket etcd_http_received_total
- etcd_http_failed_total and etcd_http_received_total
- send a warning if 0.2% of the HTTP requests to etcd fail
- curl http://127.0.0.1:2379/v2/keys/somekey_exist
- curl http://127.0.0.1:2379/v2/keys/somekey_dont_exist
- process_open_fds and process_max_fds - This is typical Prometheus endpoint, this may be available from VM level agent metrics