Global WorkQueue monitoring - dmwm/WMCore GitHub Wiki

This wiki describes which information is monitored in Global WorkQueue, which ends up in WMStats (agentInfo couch view) and which is also pushed to the MonIT infrastructure, with a different data structure to ease data aggregation and visualization in Kibana/Graphana.

This information is collected by a specific CMSWEB backend running a specific CherryPy thread: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/HeartbeatMonitor.py

which runs every 10min.

What is monitored and published to WMStats

Global workqueue monitoring data is only collected for workflows that are active in the system, thus archived workflows are not considered anywhere in this monitoring. Before touching the specific metrics collected here, it's better to go through some parameter and wording definitions, like:

  • num_elem: amount/count of GQE - short for global workqueue elements - grouped by another metric.
  • sum_jobs: sum of the estimated top level jobs ("Jobs" field from the GQE) present in the GQE that were grouped by another metric.
  • max_jobs_elem: contains the largest number of jobs ("Jobs" field from the GQE) found in a single GQE out of all the GQE that were grouped by another metric.
  • possible vs unique: "possible" jobs consider that all jobs in a GQE can run at every single site that is in the SiteWhitelist and not in the SiteBlacklist. While "unique" evenly distributes the amount of jobs in a GQE among all the sites (SiteWhitelist - SiteBlacklist).
  • AAA: this AAA metric assumes that jobs can run in any resources provided in the SiteWhitelist - SiteBlacklist. On the other hand, metrics without AAA use the possibleSites() function which evaluates data locality as well in addition to the site white and black list (including parent and pileup, if needed).

With these 3 important parameters in mind, the following are the metrics collected at Global WorkQueue level:

  • workByStatus: very basic global workqueue metric grouping amount of work in each workqueue element status.
  • workByStatusAndPriority: lists the amount of work in each GQE status and their priority.
  • workByAgentAndPriority: it lists the amount of work in terms of GQE, sum of all their jobs and their maximum number of jobs in a single GQE. This information is available for every pair of agent_name (representing which agent is processing those GQE, "AgentNotDefined" is used when the GQE hasn't been pulled down by any agent) and GQE priority.
  • workByAgentAndStatus: lists the amount of work grouped by status and agent that has acquired those GQE.
  • uniqueJobsPerSite: it reports the amount of unique jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, considering data locality constraints and evenly distributing the amount of jobs among the final list of possible sites. E.g., a workflow containing 1 single GQE with 500 jobs and assigned to FNAL and CERN, would be reported as 250 jobs/1 GQE for FNAL and 250 jobs/1 GQE for CERN too.
  • possibleJobsPerSite: it reports the amount of possible jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, considering data locality constraints and assuming the whole SiteWhitelist-SiteBlacklist could run all those jobs. Data is grouped by status and site. E.g., a workflow containing 1 single GQE with 500 jobs and assigned to FNAL and CERN, would be reported as 500 jobs/1 GQE for FNAL and 500 jobs/1 GQE for CERN too.
  • possibleJobsPerSiteAAA: it reports the amount of possible jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, assuming the whole SiteWhitelist-SiteBlacklist could run all those jobs. Information is grouped by status and site.
  • uniqueJobsPerSiteAAA: it reports the amount of unique jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, assuming any site in SiteWhitelist-SiteBlacklist is capable of running those jobs, which get evenly distributed among the final list of possible sites.
  • and the basic information collected for any WMCore service, like agent_url, agent_version, total_query_time representing how long it took to collect all these metrics from the database, down_components with the list of components/threads down and timestamp.

What is monitored and published to MonIT

The monitoring information posted to the MonIT systems are actually coming from the same metrics that are posted to WMStats, so we'll reference those metric names here as well such that you can check their description above. This wiki will also show a sample of each of those documents posted to AMQ/Elastic Search such that it becomes easier to look them up in ES via Kibana/Graphana. Every single document posted to MonIT has the following key/value pairs in addition to the payload:

{"agent_url": "global_workqueue",
 "timestamp": 12345}

where timestamp is set just before uploading the documents to AMQ (note timestamp is placed under data.metadata by the AMQ client). The actual monitoring data is always available under data.payload., so an example ES query for a Global WorkQueue metric would be:

data.payload.agent_url:global_workqueue AND data.payload.type:work_info
  • workByStatus: represented by work_info document type. A single document per status is created every cycle, regardless whether there are any workflows within that status or not.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_info",
        "status": "Acquired",
        "max_jobs_elem": 0,
        "sum_jobs": 0,
        "num_elem": 0
      }
  • workByStatusAndPriority: represented by work_prio_status document type. Creates a document for each combination of status AND priority every cycle. Status without any workflows (Priority data) will be skipped.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_prio_status",
        "status": "Available",
        "max_jobs_elem": 125,
        "sum_jobs": 135,
        "num_elem": 2,
        "priority": 316000
      }
  • workByAgentAndStatus: represented by work_agent_status document type. Creates a document for each combination of status AND agent_name every cycle. If the system does not have any work on those 3 statuses, this metric is not pushed to MonIT.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_agent_status"
        "status": "Available",
        "agent_name": "AgentNotDefined",
        "sum_jobs": 880,
        "num_elem": 34,
        "max_jobs_elem": 220
      }
  • workByAgentAndPriority: represented by work_agent_prio document type. Creates a document for each combination of agent_name (defaults to AgentNotDefined) AND priority every cycle. If the system does not have any work on those 3 statuses, this metric is not pushed to MonIT.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_agent_prio",
        "max_jobs_elem": 42,
        "agent_name": "AgentNotDefined",
        "sum_jobs": 75,
        "num_elem": 3,
        "priority": 170000
      }
  • uniqueJobsPerSite: represented by work_site_unique document type. Creates a document for each combination of status AND site_name every cycle. If the system does not have any work on those 3 statuses, this metric is not pushed to MonIT.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_site_unique",
        "status": "Available",
        "site_name": "T1_US_FNAL",
        "sum_jobs": 419,
        "num_elem": 13
      }
  • possibleJobsPerSite: represented by work_site_possible document type. Creates a document for each combination of status AND site_name every cycle. If the system does not have any work on those 3 statuses, this metric is not pushed to MonIT.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_site_possible",
        "status": "Available",
        "site_name": "T2_CH_CERN",
        "sum_jobs": 824,
        "num_elem": 31
      }
  • uniqueJobsPerSiteAAA: represented by work_site_uniqueAAA document type. Creates a document for each combination of status AND site_name every cycle. If the system does not have any work on those 3 statuses, this metric is not pushed to MonIT.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_site_uniqueAAA",
        "status": "Available",
        "site_name": "T1_US_FNAL",
        "sum_jobs": 437,
        "num_elem": 31
      }
  • possibleJobsPerSiteAAA: represented by work_site_possibleAAA document type. Creates a document for each combination of status AND site_name every cycle. If the system does not have any work on those 3 statuses, this metric is not pushed to MonIT.
 "payload": {
        "agent_url": "global_workqueue",
        "type": "work_site_possibleAAA",
        "status": "Available",
        "site_name": "T3_US_NERSC",
        "sum_jobs": 15,
        "num_elem": 1
      }
  • basic Global WorkQueue information (as described for WMStats) is not currently posted to MonIT.