K8S Monitoring - nijov/micro-services GitHub Wiki

https://www.outcoldsolutions.com/docs/monitoring-kubernetes/v5/prometheus/

curl -L http://localhost:2379/metrics | grep -v debugging curl -L http://localhost:2379/health

https://coreos.com/blog/developing-prometheus-alerts-for-etcd.html

etcd metrics API Reference: https://etcd.io/docs/v3.4.0/metrics/

slunk k8s metrics info: https://www.grafanacon.org/2019/presentations/Bob_Cotton_GrafanaCon_2019.pdf

  1. etcd is up and running- if etcd metrics end point is responding, etcd is running
  1. etcd_server_has_leader metrics - the etcd cluster must have a leader. Only time it wont respond is when the cluster coming up and first leader is not elected.
  • if this metrics endpoint show response 0, generate alert.
  1. Unsafe number of peers in the cluster

    • should have at least 3 of them running. If there are less than 3 peers running, alert should be generated.
  2. etcd_http_successful_duration_seconds_bucket - use this metrics to check if HTTP requests are responding slow

  • Average HTTP response time for etcd cluster shall be be above average threashold etcd_http_failed_total etcd_http_successful_duration_seconds_bucket etcd_http_received_total
  1. etcd_http_failed_total and etcd_http_received_total
  1. process_open_fds and process_max_fds - This is typical Prometheus endpoint, this may be available from VM level agent metrics