WMAgent monitoring - dmwm/WMCore GitHub Wiki

This wiki describes which information is monitored in each WMAgent node, which ends up in WMStats (agentInfo couch view) and which is also pushed to the MonIT infrastructure, with a different data structure between them to ease data aggregation and visualisation in Kibana/Graphana.

This information is collected by the AgentStatusPoller component every 5 minutes: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/AgentStatusPoller.py

The data monitored in each agent can be classified in 5 groups:

priority-wise information
site-wise information
work-wise information
agent general information
agent health information

What is monitored and published to WMStats

The agent is collecting different information from different data sources, like data coming from the local Couch databases, data that comes from the resource control, or data coming from the WMBS tables, and also data collected from the BossAir tables.

The monitoring metrics made available by the agent are:

workers: lists every single worker thread registered in the local database, publishing the following parameters:
- name: name of the worker thread.
- state: the current state of the worker thread (Running means everything is fine).
- cycle_time: contains the time spent by the worker thread to run its last cycle (in seconds).
- poll_interval: contains the polling cycle of the worker thread (in seconds), meaning the thread algorithm is run every X seconds after the previous cycle has completed.
- last_updated: the last time there was a heartbeat for this worker thread. Keep in mind the thread doesn't have any heartbeat while it's running its algorithm code.
LocalWQ_INFO/uniqueJobsPerSite: it reports the amount of unique jobs and workqueue elements for LQE in one of the following status: Available and Acquired, considering data locality constraints and evenly distributing the amount of jobs among the final list of possible sites. E.g., a workflow containing 1 single LQE with 500 jobs and assigned to FNAL and CERN, would be reported as 250 jobs/1 LQE for FNAL and 250 jobs/1 LQE for CERN too.
LocalWQ_INFO/possibleJobsPerSite: it reports the amount of possible jobs and workqueue elements for LQE in one of the following status: Available and Acquired, considering data locality constraints and assuming the whole SiteWhitelist-SiteBlacklist could run all those jobs. Data is grouped by status and site. E.g., a workflow containing 1 single LQE with 500 jobs and assigned to FNAL and CERN, would be reported as 500 jobs/1 LQE for FNAL and 500 jobs/1 LQE for CERN too.
LocalWQ_INFO/workByStatusAndPriority: lists the amount of work in each LQE status and their priority, including all possible WorkQueue Element statuses.
LocalWQ_INFO/workByStatus: lists the amount of work in each LQE status, including all possible WorkQueue Element statuses.
LocalWQ_INFO/total_query_time: time spent (in secs) querying the local WorkQueue databases (which possibly includes the workqueue_inbox too).
WMBS_INFO/wmbsCreatedTypeCount: contains the total amount of jobs in the agent grouped by the job type.
WMBS_INFO/wmbsExecutingTypeCount: lists the amount of WMBS jobs in the executing state grouped by the job type.
WMBS_INFO/sitePendCountByPrio: lists all the sites available in the current agent database, listing the amount of jobs grouped by their priority.
WMBS_INFO/wmbsCountByState: lists the amount of jobs in each WMBS job state.
WMBS_INFO/activeRunJobByStatus: lists the amount BossAir jobs that are still active (status=1) grouped by the schedd status name (all schedd statuses are published, even if there are no jobs in that status).
WMBS_INFO/completeRunJobByStatus: lists the amount BossAir jobs that are completed (status=0) grouped by the schedd status name (all schedd statuses are published, even if there are no jobs in that status).
WMBS_INFO/thresholds: lists all the sites available in the current agent database, providing their state, their site-wide running and pending thresholds (in terms of jobs) in the local database.
WMBS_INFO/thresholdsGQ2LQ: contains the site thresholds for job creation (as defined by the function freeSlots(minusRunning=True)). These are the thresholds used for pulling work from the global workqueue (respecting other constraints like QueueDepth, agent status, schedd status, work priority, etc)
WMBS_INFO/total_query_time: time spent (in secs) querying the local WMBS/BossAir tables.
and the basic information collected for any WMCore service, like:
- agent_url: contains the FQDN of the host
- agent_version: contains the WMAgent version being used
- agent_team: the team name to which the agent is connected to.
- status: it can be either "ok" to say everything is fine; "warning" to report non-critical issues; and "error" reporting components misbehaving or down.
- drain_mode: shows whether the agent is in drain mode or not.
- down_components: contains a list of worker threads that are having problems (or empty list if everything is fine)
- down_component_detail: contains the last 50 lines of the components log, which theoretically shall have the error/exception.
- data_last_update: last time this monitoring information was collected and pushed to WMStats
- data_error: I don't know what it is!!!
- disk_warning: empty list if everything is fine, otherwise it reports the partition which is above 85% filled.

What is monitored and published to MonIT

The monitoring information posted to the MonIT systems are actually coming from the same metrics that are posted to WMStats, so we'll reference those metric names here as well such that you can check their description above. This wiki will also show a sample of each of those documents posted to AMQ/Elastic Search such that it becomes easier to look them up in ES via Kibana/Graphana. The actual monitoring data is always available under data.payload., so an example ES query for a Global WorkQueue metric would be:

data.payload.agent_url:vocms0192.cern.ch AND data.payload.type:wma_prio_info

sitePendCountByPrio: represented by wma_prio_info document type. A single document gets created for each site with jobs pending.

 "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_prio_info"
        "job_count": 17,
        "site_name": "T1_US_FNAL",
        "priority": 123000
      }

thresholds, thresholdsGQ2LQ, possibleJobsPerSite and uniqueJobsPerSite: represented by wma_site_info document type. Creates a document for every single site aggregating data about the site thresholds, state, workqueue elements, possible and unique jobs assigned to it.

  "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_site_info"
        "unique_available_jobs": 0,
        "unique_acquired_jobs": 0,
        "site_name": "T2_MY_UPM_BIRUNI",
        "possible_acquired_jobs": 0,
        "thresholdsGQ2LQ": 0,
        "thresholds": {
          "running_slots": 50,
          "pending_slots": 50
        },
        "possible_available_jobs": 0,
        "num_acquired_elem": 0,
        "state": "Down",
        "num_available_elem": 0
      }

workByStatus: represented by wma_work_info document type. Creates a document for each LQE status, regardless whether it has any work on that status or not.

  "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_work_info",
        "status": "Running",
        "sum_jobs": 1455,
        "num_elem": 19
      }

wmbsCreatedTypeCount and wmbsExecutingTypeCount: represented by wma_wmbs_info document type. It creates a document for every single job type in the system and shows the amount of created and executing jobs in each of those job types.

  "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_wmbs_info",
        "job_type": "Cleanup",
        "created_jobs": 0,
        "executing_jobs": 2
      }

activeRunJobByStatus and completeRunJobByStatus: represented by wma_agent_info document type. It creates a document for every single BossAir job state (schedd_status) and shows the amount of active and completed jobs in that status.

  "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_agent_info",
        "completed_jobs": 168,
        "schedd_status": "Running",
        "active_jobs": 453
      }

wmbsCountByState: represented by wma_wmbs_state_info document type. It creates a document for each wmbs job state and the count of jobs in it.

  "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_wmbs_state_info",
        "wmbs_status": "complete",
        "num_jobs": 18
      }

workers: represented by wma_health_info document type. Creates a document for every single worker thread, reporting its status and a few other metrics as described in the section above.

  "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_health_info",
        "worker_last_hb": 1544006483,
        "worker_poll": 300,
        "worker_cycle_time": 0.7498,
        "worker_state": "Running",
        "worker_name": "ArchiveDataPoller"
      }

and finally some very basic information of the agent, like team name, general status, etc. Information represented by wma_summary_info document type. A single document per agent is created every cycle.

  "payload": {
        "agent_url": "vocms0192.cern.ch",
        "type": "wma_summary_info",
        "agent_team": "testbed-vocms0192",
        "down_components": [],
        "drain_mode": false,
        "agent_status": "ok",
        "agent_version": "1.1.18.patch2",
        "wq_query_time": 0,
        "wmbs_query_time": 0
      }