Monitoring of CMSWEB services with Prometheus AlertManager - dmwm/WMCore GitHub Wiki

Central services that run in the CMSWEB Kubernetes cluster are monitored with Prometheus, either via standard exporters like for process monitoring, for couchdb, etc; or via custom CMS monitoring scripts such as the liveness probe k8s service.

Prometheus and these exporters are fetching node and services metrics, which are then made available in a centralized database (elastic search?), and those metrics are constantly evaluated with the service rules defined in the CMSKubernetes repository. Further information has been provided by the Monitoring team HERE.

Please check the CMSMonitoring documentation HERE for more information on these alerts, where the rules are stored, and how to check which rules are enforced on the Prometheus server.

Updating rules for a given service

Whenever we want to update the Prometheus/AM based rules and alerts, changes must be provided to the CMSKubernetes repository. There are two files that need to be considered:

  • your_service_name.rule: which contains the rule definition, the conditions to trigger an alert, the alert definition itself, and a time interval in which the rule needs to be evaluated
  • your_service_name.test: a unit test for your rules

Once these changes have been made, we should check the rule definition and also test it with our unit test file. For that, a promtool has been made available and deployed in CVMFS. In order to test our rules definition, we can run it like:

amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool check rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules 
Checking kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules
  SUCCESS: 4 rules found

and to run the unit tests we have defined:

amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool test rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test 
Unit Testing:  kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test
  SUCCESS

Once everything is looking well on our side, we make a pull request for the CMSKubernetes repository, and ask the HTTP team to deploy these changes to CMSWEB.

Querying Monit and AlertManager

In some use cases it is much more useful to fetch all the alerts as uploaded to monit in the form of a .json file rather than working with the monitoring page itself (e.g. https://cms-monitoring.cern.ch/alertmanager/#/alerts?receiver=dmwm-admins&filter={service%3D%22ms-rulecleaner%22} )

For this, any machine behind the CERN firewall can be used with standard tools for making the HTTP calls. Here is an example of fetching all alarms produced by a particular service and sent to a single group:

curl -o MSRulecleanerAlarms.json  http://cms-monitoring.cern.ch:30093/api/v2/alerts\?receiver=dmwm-admins\&service=ms-rulecleaner