Monitoring - cloudigrade/cloudigrade GitHub Wiki

Cloudigrade Dashboards

These are maintained by the cloudigrade devs to provide information in support of our SLOs.

k8s compute resources

These are provided by the platform and show CPU, memory, and network use.

Insights SLO dashboards for cloudigrade

These are provided by the platform and show total request count and status count proportions over time.

DVO validation errors

These are provided by AppSRE to show what DVO validation errors are triggered.

Where does Grafana live?

Stage: https://grafana.stage.devshift.net/
Prod: https://grafana.app-sre.devshift.net/

Other URLs that matter?

Stage Prometheus: https://prometheus.crcs02ue1.devshift.net/
Prod Prometheus: https://prometheus.crcp01ue1.devshift.net/

How to update the dashboard?

The best way to work on the dashboard is to head to the stage grafana instance and open the Cloudigrade dashboard there. You'll be able to make and test all the changes you want, with one catch, you can not save your dashboard directly there. When you go to save you'll be presented with a JSON representation of the dashboard. Paste that JSON whole into grafana-dashboard-clouddot-insights-cloudigrade.configmap.yaml being careful not to break the ConfigMap formatting.

After the code is merged to master, it should automatically deploy to stage. To promote the dashboard from stage to production, update the ref here.

How do I make my dashboards not suck, any resources?

Testing SRE/app-interface alerts

Although you cannot directly test an alert in the stage/prod environments, you must write unit tests for all alerts.

Note that running tests locally requires you to run the qontract-reconcile server in one shell while calling qontract-cli from another. Unfortunately, the server is not smart enough to detect changes to files while it's running. So, if you are actively developing a test, each time you save, you'll need to stop the server and make server which rebuilds files and restarts the server. Yes, this is painfully slow. :shrug: If we can find a faster way to test, we'll update these notes.

The arguments for qontract-cli are not entirely obvious. Here is one set of commands that should work:

$ qontract-cli --config config.promtool.toml run-prometheus-test -s config \
  resources/insights-prod/cloudigrade-prod/cloudigrade.prometheusrules-test.yaml  \
  crcp01ue1
$ qontract-cli --config config.promtool.toml run-prometheus-test -s config \
  resources/insights-stage/cloudigrade-stage/cloudigrade.prometheusrules-test.yaml  \
  crcs02ue1

Note that you need to specify which cluster (e.g. crcp01ue1) the test would run in. This command doesn't actually call anything in the cluster; it's just data for the testing.