Monitoring - cloudigrade/cloudigrade GitHub Wiki
Cloudigrade Dashboards
These are maintained by the cloudigrade devs to provide information in support of our SLOs.
- Stage: https://grafana.stage.devshift.net/d/O6v4rMpizda/cloudigrade?orgId=1&var-datasource=crcs02ue1-prometheus&var-namespace=cloudigrade-stage&var-datasource_rds=app-sre-stage-01-prometheus
- Prod: https://grafana.app-sre.devshift.net/d/O6v4rMpizda/cloudigrade?orgId=1&var-datasource=crcp01ue1-prometheus&var-namespace=cloudigrade-prod&var-datasource_rds=app-sre-prod-01-prometheus
k8s compute resources
These are provided by the platform and show CPU, memory, and network use.
- Stage: https://grafana.stage.devshift.net/d/k8s-compute-resources-namespace-pods/kubernetes-compute-resources-namespace-pods?orgId=1&var-datasource=crcs02ue1-cluster-prometheus&var-namespace=cloudigrade-stage
- Prod: https://grafana.app-sre.devshift.net/d/k8s-compute-resources-namespace-pods/kubernetes-compute-resources-namespace-pods?orgId=1&&var-datasource=crcp01ue1-cluster-prometheus&var-namespace=cloudigrade-prod
Insights SLO dashboards for cloudigrade
These are provided by the platform and show total request count and status count proportions over time.
- Stage: https://grafana.stage.devshift.net/d/slo-dashboard/slo-dashboard?orgId=1&var-datasource=crcs02ue1-prometheus&var-label=cloudigrade
- Prod: https://grafana.app-sre.devshift.net/d/slo-dashboard/slo-dashboard?orgId=1&var-datasource=crcp01ue1-prometheus&var-label=cloudigrade
DVO validation errors
These are provided by AppSRE to show what DVO validation errors are triggered.
- Stage: https://grafana.app-sre.devshift.net/d/dashdotdb/dash-db?orgId=1&var-datasource=dashdotdb-rds&var-cluster=crcs02ue1&var-namespace=cloudigrade-stage
- Prod: https://grafana.app-sre.devshift.net/d/dashdotdb/dash-db?orgId=1&var-datasource=dashdotdb-rds&var-cluster=crcp01ue1&var-namespace=cloudigrade-prod
Where does Grafana live?
Other URLs that matter?
- Stage Prometheus: https://prometheus.crcs02ue1.devshift.net/
- Prod Prometheus: https://prometheus.crcp01ue1.devshift.net/
How to update the dashboard?
The best way to work on the dashboard is to head to the stage grafana instance and open the Cloudigrade dashboard there. You'll be able to make and test all the changes you want, with one catch, you can not save your dashboard directly there. When you go to save you'll be presented with a JSON representation of the dashboard. Paste that JSON whole into grafana-dashboard-clouddot-insights-cloudigrade.configmap.yaml being careful not to break the ConfigMap formatting.
After the code is merged to master, it should automatically deploy to stage. To promote the dashboard from stage to production, update the ref here.
How do I make my dashboards not suck, any resources?
- Use thanos to test your PromQL queries
- Prometheus Querying Basics Docs
- A blog on counters
- Grafana Introduction to PromQL
- Official python prometheus client readme
Testing SRE/app-interface alerts
Although you cannot directly test an alert in the stage/prod environments, you must write unit tests for all alerts.
Note that running tests locally requires you to run the qontract-reconcile
server in one shell while calling qontract-cli
from another. Unfortunately, the server is not smart enough to detect changes to files while it's running. So, if you are actively developing a test, each time you save, you'll need to stop the server and make server
which rebuilds files and restarts the server. Yes, this is painfully slow. :shrug: If we can find a faster way to test, we'll update these notes.
The arguments for qontract-cli
are not entirely obvious. Here is one set of commands that should work:
$ qontract-cli --config config.promtool.toml run-prometheus-test -s config \
resources/insights-prod/cloudigrade-prod/cloudigrade.prometheusrules-test.yaml \
crcp01ue1
$ qontract-cli --config config.promtool.toml run-prometheus-test -s config \
resources/insights-stage/cloudigrade-stage/cloudigrade.prometheusrules-test.yaml \
crcs02ue1
Note that you need to specify which cluster (e.g. crcp01ue1
) the test would run in. This command doesn't actually call anything in the cluster; it's just data for the testing.