ReqMgr2 monitoring - dmwm/WMCore GitHub Wiki

This wiki describes which information is monitored in ReqMgr2, which ends up in WMStats (agentInfo couch view) and which is also pushed to the MonIT infrastructure, with a different data structure to ease data aggregation and visualization in Kibana/Graphana.

This information is collected by a specific CMSWEB backend running a specific CherryPy thread: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/HeartbeatMonitor.py

which runs every 10min.

What is monitored and published to WMStats

Most of the following metrics are only collected for active workflows, meaning the workflows are in any status but those 3 archived statuses. The list of metrics collected and uploaded to WMStats are:

  • requestsByStatus: it gives an overview of the amount of workflows in the system grouped by their current status.
  • requestsByStatusAndCampaign: amount of active workflows grouped by their current status and campaign. Workflows with multiple campaigns are counted multiple times.
  • requestsByStatusAndNumEvts: it gives the total amount of events requested (RequestNumEvents) in all active workflows grouped by their current status. Notice this is not taking the FilterEfficiency into consideration. In addition to that, workflows with input dataset don't have this parameter, so it's counted as 0 for such cases.
  • requestsByStatusAndPrio: counts the amount of active workflows in each status grouped by their priority (RequestPriority).
  • and the basic information collected for any WMCore service, like agent_url, agent_version, total_query_time representing how long it took to collect all these metrics from the database, down_components with the list of components/threads down and timestamp.

What is monitored and published to MonIT

The monitoring information posted to the MonIT systems are actually coming from the same metrics that are posted to WMStats, so we'll reference those metric names here as well such that you can check their description above. This wiki will also show a sample of each of those documents posted to AMQ/Elastic Search such that it becomes easier to look them up in ES via Kibana/Graphana. Every single document posted to MonIT has the following key/value pairs in addition to the payload:

{"agent_url": "reqmgr2",
 "timestamp": 12345}

where timestamp is set just before uploading the documents to AMQ (note timestamp is placed under data.metadata by the AMQ client). The actual monitoring data is always available under data.payload., so an example ES query for a ReqMgr2 metric would be:

data.payload.agent_url:reqmgr2 AND data.payload.type:reqmgr2_status
  • requestsByStatus: represented by reqmgr2_status document type. A single document per status is created every cycle, regardless whether there are any workflows within that status or not.
 "payload": {
        "agent_url": "reqmgr2",
        "type": "reqmgr2_status",
        "request_status": "aborted-completed",
        "num_requests": 0
      },
  • requestsByStatusAndCampaign: represented by reqmgr2_campaign document type. Creates a document for each combination of request_status AND campaign every cycle. Status without any workflows (Campaign data) will be skipped.
 "payload": {
        "agent_url": "reqmgr2",
        "type": "reqmgr2_campaign",
        "request_status": "assignment-approved",
        "num_requests": 7,
        "campaign": "HG1804_Validation"
      }
  • requestsByStatusAndNumEvts: represented by reqmgr2_events document type. A single document per status is created every cycle, regardless whether there are any workflows (RequestNumEvents) within that status or not.
 "payload": {
        "agent_url": "reqmgr2",
        "type": "reqmgr2_events",
        "total_num_events": 100000,
        "request_status": "running-closed"
      }
  • requestsByStatusAndPrio: represented by reqmgr2_prio document type. Creates a document for each combination of request_status AND request_priority every cycle. Status without any workflows (RequestPriority data) will be skipped.
 "payload": {
        "agent_url": "reqmgr2",
        "type": "reqmgr2_prio",
        "request_status": "failed",
        "request_priority": 600000,
        "num_requests": 2
      }
  • basic ReqMgr2 information (as described for WMStats) is not currently posted to MonIT.