36 使用Prometheus来监控kubernetes群组 - xiaoxin01/Blog GitHub Wiki
文本介绍如何使用Prometheus来监控kubernetes群组
kubernetes集群搭建好以后,监控其健康状况便是一件重要的事情。官方提供了可以监控的方法:
Tools for Monitoring Resources{:target="_blank"}
下面介绍如何使用Prometheus:
以ubuntu为例:
wget https://get.helm.sh/helm-v2.14.3-linux-amd64.tar.gz
tar -zxvf helm-v2.14.3-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
helm init --upgrade
参考:https://helm.sh/docs/using_helm/#from-the-binary-releases
helm install --name prometheus --namespace monitoring stable/prometheus-operator --set prometheusOperator.createCustomResource=false -f values.yaml
ps. 可能会出现错误,xxx already exists
解决办法:
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
参考:
https://github.com/helm/charts/issues/10316
values.yaml:
prometheusOperator.nodeSelector:
mediatek/role: db
prometheusOperator.admissionWebhooks.patch.nodeSelector:
mediatek/role: job
prometheus.prometheusSpec.nodeSelector:
mediatek/role: db
alertmanager.alertmanagerSpec.nodeSelector:
mediatek/role: app
grafana:
persistence:
enabled: true
type: pvc
existingClaim: prometheus-prometheus-migration-prometheus-db-prometheus-prometheus-migration-prometheus-0
accessModes: ReadWriteMany
subPath: grafana
# ingress:
# enabled: true
pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-prometheus-migration-prometheus-0
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 12Gi
nfs:
path: /data/prometheus
server: xxxxx
persistentVolumeReclaimPolicy: Retain
storageClassName: prometheus
volumeMode: Filesystem
ps. /data/prometheus 路径要提前创建好
pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
app: prometheus
prometheus: prometheus-migration-prometheus
name: prometheus-prometheus-migration-prometheus-db-prometheus-prometheus-migration-prometheus-0
namespace: monitoring
spec:
accessModes:
- ReadWriteMany
dataSource: null
resources:
requests:
storage: 12Gi
storageClassName: prometheus
volumeMode: Filesystem
volumeName: pvc-prometheus-migration-prometheus-0
如果mount nfs路径,则需要在每台node上启用服务:
sudo apt-get install nfs-common -y
sudo su
systemctl enable rpcbind && systemctl start rpcbind
systemctl enable rpc-statd && systemctl start rpc-statd
创建ingress
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: grafana
namespace: monitoring
spec:
rules:
- host: grafana.monitoring.xxxxxx.xip.io
http: &http_rules
paths:
- backend:
serviceName: prometheus-grafana
servicePort: 80
- host: grafana.example.com
http: *http_rules
如下是旧版
# Install helm https://docs.helm.sh/using_helm/ then run:
helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring
完成之后可以通过如下命令查看服务运行状况:
$ kubectl get service -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 2d
kube-prometheus ClusterIP 10.43.214.80 <none> 9090/TCP 2d
kube-prometheus-alertmanager ClusterIP 10.43.177.76 <none> 9093/TCP 2d
kube-prometheus-exporter-kube-state ClusterIP 10.43.191.75 <none> 80/TCP 2d
kube-prometheus-exporter-node ClusterIP 10.43.94.42 <none> 9100/TCP 2d
kube-prometheus-grafana ClusterIP 10.43.19.253 <none> 80/TCP 2d
prometheus-operated ClusterIP None <none> 9090/TCP 2d
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: prometheus
namespace: monitoring
spec:
rules:
- host: prometheus.monitoring.xxx.xxx.xxx.xip.io
http:
paths:
- backend:
serviceName: kube-prometheus-grafana
servicePort: 80
完成之后,就可以通过 http://prometheus.monitoring.xxx.xxx.xxx.xip.io 来查看 kubernetes 集群信息了
Grafana默认的管理员账号密码为:admin/admin,登录之后请修改密码。
系统默认建立了一些看板用于查看集群状况:
- Deployment
- Kubernetes Capacity Planning (集群负载状况)
- Kubernetes Cluster Health (集群健康状况)
- ……
默认的看板虽然提供查看Deployment、Pods的状况,但是当我们需要了解特定服务的整体情况时,默认看板不是很方便,这时可以建立通用看板,将需要的数据放到看板中。
点击 "+" --> "Create" --> "Dashboard",创建一个新的看板,然后在 topmenu 上点击 "Add panel" --> "Graph" --> "Metrics"
在A中输入下面的语句:
avg by(pod_name) (container_memory_usage_bytes{pod_name=~"service_a.*", container_name!="POD"})
此时可以显示 service_a 的所有 pods 的内存使用状况。
为了让看板变得通用,我们可以添加服务名称变量:settings -> variables --> new
Name输入service_name, Query输入
label_values(kube_deployment_metadata_generation{namespace="default"}, deployment)
然后将A的语句调整为
avg by(pod_name) (container_memory_usage_bytes{pod_name=~"$service_name.*", container_name!="POD"})
然后就可以通过选择看板的 service 下拉选单来查看不同 service 的内存状况了。
Grafana提供folder和dashboard两个级别的权限,默认是允许匿名访问的,即所有人都可以访问看板信息。如果 kubernetes 有不同的 team 需要查看不同的 service信息,则需要区分权限。
比较好的做法是通过folder来区分权限。
默认的看板所在的folder是无法更改权限的,可以新建一个folder,然后将默认的看板全部移动到该folder下,然后再设定该folder的权限为不允许匿名访问。