Prometheus AlertManager wrapper APIs - dmwm/WMCore GitHub Wiki

This is a basic documentation for the Prometheus AlertManager service and how WMCore wrapped some of those APIs in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/AlertManager/AlertManagerAPI.py . This wiki is based on the gdoc documentation created by Erik.

Background

Email alerting we typically use in WMCore (sendmail via localhost) does not work in Kubernetes pods, see: https://github.com/dmwm/WMCore/issues/10234 Given that we don’t want to put sendmail in each of our pods, so we looked into some alternatives for alerting. We plan to switch to using the MONIT/Prometheus AlertManager API to handle alerts that require notification via email, slack or need to be viewed in a dashboard; this service is supported by the CMS Monitoring and many other services have already adopted it.

AlertManager

AlertManager is highly flexible and alerts are sent with an alertname and certain labels that can determine how the alert will be routed. The alert name - or a combination of alertname and labels - uniquely identify an alert. i.e. if ms-transferor needs to alert on two transfers, they need unique names or the same name and a unique set of labels. Otherwise, the second alert will overwrite the first. Alerts can also be sent with annotations providing additional information about the alert. Annotations do not uniquely identify an alert.

The routing and type of notifications (email/slack/grafana dashboard/etc) has to be configured by the CMS Monitoring, and a JIRA ticket needs to be opened for such requests.

Alerts can be seen at: https://cms-monitoring.cern.ch/alertmanager/#/alerts and further information can be found at: https://prometheus.io/docs/alerting/latest/alertmanager/

WMCore alert structure

Alerts can have a fairly complex structure. Here we try to expand on those and describe how WMCore is going to use them (and how it has been implemented in our service wrapper). An example is as follows:

{
    "annotations": {
        "hostname": "esg-dmwm-dev1.cern.ch",
        "summary": "[MSOutput] Campaign test not found in central CouchDB",
        "description": "Dataset: test cannot have an output transfer rule because its campaign: test cannot be found in central CouchDB."
    },
    "labels": {
        "tag": "wmcore",
        "service": "ms-output",
        "alertname": "a_random_workflow_name",
        "severity": "high"
    },
    "endsAt": "2021-03-05T17:57:47.256281+01:00",
    "generatorURL": "https://cmsweb.cern.ch"
}

where:

annotations.hostname: is the host triggering this alert (automatically filled when creating an instance of the AlertManagerAPI).
annotations.summary: a very short summary of the problem (meant to be the email title, if email notification has been set up).
annotations.description: a longer description of the problem (meant to be the body of the email, if email notification has been set up).
labels.tag: will be wmcore by default, such that we can properly group all these alerts and properly route them.
labels.service: the service name that is triggering this alert (workqueue, reqmgr2, ms-transferor, etc). We will need to add some validation/sanitization in the future.
labels.alertname: will likely be a workflow name (if alert concerns a workflow failure); or a service name (in case it concerns a general problem with a component/service).
labels.severity: it can be one of these: ["high", "medium", "low"]. For WMCore usage, medium severity will trigger a slack notification only; while high will trigger both slack and email notifications.

From the AlertManagerAPI implementation, the following parameters are mandatory when sending an alert: alertName, severity, summary, description, service.

Listing alerts with CLI tools

As informed by Valentin, there is an alert CLI tool available in CVMFS: /cvmfs/cms.cern.ch/cmsmon/alert, which can be used to look-up all our alerts. It has a -help option explaining what are the supported options and how to use it. However, a few examples are:

# Get all alerts based on multi filters. Ex tag=wmcore, severity=high:
alert -tag=wmcore -severity=high

# Get all alerts of specific service/severity/tag/name.
alert -service=ms-transferor -severity=medium -tag=wmcore -name=Campaign

The alerts can be viewed in plain (human friendly) and json formats. The tool can be used on any linux, and inside or outside of CERN network (for later it will require a valid token).

Alert routing

Rules need to be defined for AlertManager such that it knows how to route notifications created by the applications. The definition of those rules is made in this GitLab alertmanager.yaml file.

These rules contain the parameters and values that we want to look for, in addition to directions of where to redirect them. For further information and details, a JIRA ticket can be created for the CMSMonitoring team.