36 使用Prometheus来监控kubernetes群组 - xiaoxin01/Blog GitHub Wiki

文本介绍如何使用Prometheus来监控kubernetes群组

kubernetes集群搭建好以后,监控其健康状况便是一件重要的事情。官方提供了可以监控的方法:

Tools for Monitoring Resources{:target="_blank"}

下面介绍如何使用Prometheus:

升级 helm

以ubuntu为例:

wget https://get.helm.sh/helm-v2.14.3-linux-amd64.tar.gz
tar -zxvf helm-v2.14.3-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
helm init --upgrade

参考:https://helm.sh/docs/using_helm/#from-the-binary-releases

安装 Prometheus

helm install --name prometheus --namespace monitoring stable/prometheus-operator --set prometheusOperator.createCustomResource=false -f values.yaml

ps. 可能会出现错误,xxx already exists

解决办法:

kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com

参考:

https://github.com/helm/charts/issues/10316

values.yaml:

prometheusOperator.nodeSelector:
  mediatek/role: db
prometheusOperator.admissionWebhooks.patch.nodeSelector:
  mediatek/role: job
prometheus.prometheusSpec.nodeSelector:
  mediatek/role: db
alertmanager.alertmanagerSpec.nodeSelector:
  mediatek/role: app
grafana:
  persistence:
    enabled: true
    type: pvc
    existingClaim: prometheus-prometheus-migration-prometheus-db-prometheus-prometheus-migration-prometheus-0
    accessModes: ReadWriteMany
    subPath: grafana
#  ingress:
#    enabled: true

pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-prometheus-migration-prometheus-0
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 12Gi
  nfs:
    path: /data/prometheus
    server: xxxxx
  persistentVolumeReclaimPolicy: Retain
  storageClassName: prometheus
  volumeMode: Filesystem

ps. /data/prometheus 路径要提前创建好

pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: prometheus
    prometheus: prometheus-migration-prometheus
  name: prometheus-prometheus-migration-prometheus-db-prometheus-prometheus-migration-prometheus-0
  namespace: monitoring
spec:
  accessModes:
  - ReadWriteMany
  dataSource: null
  resources:
    requests:
      storage: 12Gi
  storageClassName: prometheus
  volumeMode: Filesystem
  volumeName: pvc-prometheus-migration-prometheus-0

如果mount nfs路径,则需要在每台node上启用服务:

sudo apt-get install nfs-common -y
sudo su
systemctl enable rpcbind && systemctl start rpcbind
systemctl enable rpc-statd && systemctl start rpc-statd

创建ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
spec:
  rules:
  - host: grafana.monitoring.xxxxxx.xip.io
    http: &http_rules
      paths:
      - backend:
          serviceName: prometheus-grafana
          servicePort: 80
  - host: grafana.example.com
    http: *http_rules

如下是旧版

安装 helm

参考:kubernetes使用helm部署服务

安装 Prometheus

# Install helm https://docs.helm.sh/using_helm/ then run:
helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring

完成之后可以通过如下命令查看服务运行状况:

$ kubectl get service -n monitoring

NAME                                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
alertmanager-operated                 ClusterIP   None           <none>        9093/TCP,6783/TCP   2d
kube-prometheus                       ClusterIP   10.43.214.80   <none>        9090/TCP            2d
kube-prometheus-alertmanager          ClusterIP   10.43.177.76   <none>        9093/TCP            2d
kube-prometheus-exporter-kube-state   ClusterIP   10.43.191.75   <none>        80/TCP              2d
kube-prometheus-exporter-node         ClusterIP   10.43.94.42    <none>        9100/TCP            2d
kube-prometheus-grafana               ClusterIP   10.43.19.253   <none>        80/TCP              2d
prometheus-operated                   ClusterIP   None           <none>        9090/TCP            2d

建立ingress来访问grafana

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: prometheus
  namespace: monitoring
spec:
  rules:
  - host: prometheus.monitoring.xxx.xxx.xxx.xip.io
    http:
      paths:
      - backend:
          serviceName: kube-prometheus-grafana
          servicePort: 80

完成之后,就可以通过 http://prometheus.monitoring.xxx.xxx.xxx.xip.io 来查看 kubernetes 集群信息了

Grafana的使用

Grafana默认的管理员账号密码为:admin/admin,登录之后请修改密码。

集群信息查看

系统默认建立了一些看板用于查看集群状况:

  • Deployment
  • Kubernetes Capacity Planning (集群负载状况)
  • Kubernetes Cluster Health (集群健康状况)
  • ……

查看服务状况

默认的看板虽然提供查看Deployment、Pods的状况,但是当我们需要了解特定服务的整体情况时,默认看板不是很方便,这时可以建立通用看板,将需要的数据放到看板中。

点击 "+" --> "Create" --> "Dashboard",创建一个新的看板,然后在 topmenu 上点击 "Add panel" --> "Graph" --> "Metrics"

在A中输入下面的语句:

avg by(pod_name) (container_memory_usage_bytes{pod_name=~"service_a.*", container_name!="POD"})

此时可以显示 service_a 的所有 pods 的内存使用状况。

为了让看板变得通用,我们可以添加服务名称变量:settings -> variables --> new

Name输入service_name, Query输入

label_values(kube_deployment_metadata_generation{namespace="default"}, deployment)

然后将A的语句调整为

    avg by(pod_name) (container_memory_usage_bytes{pod_name=~"$service_name.*", container_name!="POD"})

然后就可以通过选择看板的 service 下拉选单来查看不同 service 的内存状况了。

权限

Grafana提供folder和dashboard两个级别的权限,默认是允许匿名访问的,即所有人都可以访问看板信息。如果 kubernetes 有不同的 team 需要查看不同的 service信息,则需要区分权限。

比较好的做法是通过folder来区分权限。

屏蔽 kubernetes 集群信息的看板

默认的看板所在的folder是无法更改权限的,可以新建一个folder,然后将默认的看板全部移动到该folder下,然后再设定该folder的权限为不允许匿名访问。

参考

⚠️ **GitHub.com Fallback** ⚠️