WMAgent monitoring (legacy) - dmwm/WMCore GitHub Wiki

This wiki is meant to describe some aspects of WMAgent that requires a better monitoring, aiming to ease the debugging process. In general, there will be two major tasks (maybe 3...):

  1. fetch the necessary information with AgentStatusWatcher component (following a defined polling cycle and keeping in mind the additional load on the agent)
  2. publish and make this information retrievable via http
  3. work on data visualization (possible in the far future)

The main idea behind these improvements is to make the agent status (internal queues, database tables, job status, site/task thresholds, component status, etc) information easily accessible (and make the operator/developer life easier).

How to proceed

A proposal on how to progress on this development would be (Alan's opinion):

  1. implement a reasonable part (50%?) of the monitoring points as described below
  2. make this information initially available in the component log (or a json file)
  3. publish this information either to LogDB or to HTCondor collector (to condor seems to be easier and better). Still to investigate.
  4. Information needs to be queriable (API) and accessbible via http
  5. At this point, stop sending this info to the component log.
  6. In the future (> several months ahead), we could work on another service/tool that would make those json data user friendly (graphs, tables, etc) and centralized, hoping to get rid of most of the personal monitoring scripts.

What to monitor

Most of what will be listed here belong to WMAgent, however there are some other aspects from central Global Queue that would be needed to monitor as well. The initial and most important aspects are listed as:

  1. For a given agent, WMBS info for:
    1. number of jobs in WMBS in each state. (wmbsCountByState)
    2. number of jobs in WMBS in 'created' status, sorted by job type (wmbsCreatedTypeCount)
      1. BONUS: would be sorting by job priority and site as well
    3. number of jobs in WMBS in 'executing' status, sorted by job type (should be equal to the numbers of jobs in condor) (wmbsExecutingTypeCount)
    4. thesholds for data acquisition from GQ to LQ (thresholdsGQ2LQ)
    5. thresholds for job creation (LQ to WMBS)
    6. thresholds for job submission (WMBS to condor global pool)
    7. total running and pending thresholds for each site in Drain or Normal status (and their status) (thresholds)
  2. For a given agent, WorkQueue info for:
    1. number of local workqueue elements in each state (+ the total number of possible AND unique number of jobs per site) (already available in the GQ monitoring)
    2. number of local workqueue_inbox elements in each state (+ the total number of possible AND unique number of jobs per site) (workByStatus and possibleJobsPerSite + uniqueJobsPerSite)
    3. worst workqueue element offenders (single elements that create > 30k jobs) in workqueue_inbox in Acquired or Running status.
    4. BONUS: number of workqueue elements in workqueue and number of jobs sorted by priority (workByStatusAndPriority)
      1. NOTE: the number of jobs retrieved from workqueue elements cannot be taken for granted because it does not count utilitarian jobs nor has a precise number for chained requests.
  3. Agent health:
    1. for each registered thread in a Daemon, the current state (running/idle), the time since last successful execution, the time in the current state, the length of the last cycle, and the status of the last cycle (success / failure).
    2. For WorkQueueManager, the # of blocks obtained from GQ and # of blocks obtained by LQ
    3. For JobCreator, the number of jobs created in the last cycle.
    4. For JobSubmitter, the number of jobs submitted in the last cycle.
  4. Central Global Queue (job counts are not accurate since they do not count further steps):
    1. for workqueue elements in [Available, Acquired and Running] status:
      1. total number of elements and estimated jobs (workByStatus)
      2. by team name: number of WQE, estimated number of possible AND unique jobs
      3. by agent: number of WQE, estimated number of possible AND unique jobs (workByAgentAndPriority and workByStatusAndPriority and workByAgentAndStatus)
      4. by priority: number of WQE, estimated number of possible AND unique jobs (workByAgentAndPriority and workByStatusAndPriority)
      5. by site: number of WQE, estimated number of possible AND unique jobs (uniqueJobsPerSite and possibleJobsPerSite)
      6. WQE that create > 50k jobs (or whatever number we decide)
    2. specific use cases for workqueue elements in Available status
      1. WQE without a common site list (which does not pass the work restrictions)
      2. WQE older than 7 days (or whatever number we decide)
    3. specific use cases for workqueue elements in Acquired status
      1. WQE older than 7 days (or whatever number we decide)

HTCondor Data Path (approach not used!)

One channel the monitoring data is reported is through the HTCondor ClassAd. The ClassAd is a simple key-expression format (here, we'll use it as key-values) that is slightly more expressive than JSON. The monitoring data from AgentStatusWatcher will be converted from JSON to ClassAd format, and sent as a part of the schedd ClassAd to the global pool collector.

Once in the collector, the key/values can be queried by various monitoring infrastructure (Ganglia, Kibana, Grafana, ElasticSearch) utilized by the Submit Infrastructure team.

To integrate with HTCondor, AgentStatusWatcher will ship with a script that converts the current JSON to ClassAd format, writing the output from JSON. Periodically, the condor_schedd process will call this script in a non-blocking manner. The schedd will take the output ClassAd, merge it with its internal ClassAd, and send the merged version in the next update.

Description of the local agent metrics collected

These metrics are collected every 5 minutes inside each agent. This list is still expanding, however a short description of the metrics already available as of 1.1.2.patch4 release can be found below. For WMBS_INFO data, which is mostly information coming from the relational database (wmbs and bossair tables):

  • activeRunJobByStatus: represents all the jobs active in the BossAir database, which are also in the condor pool. Jobs should not remain in the “New” status longer than a JobStatusLite component cycle.
  • completeRunJobByStatus: represents jobs no longer active in the BossAir database, which are also gone from condor pool. TODO: how the cleanup and state transition happens in this case?
  • sitePendCountByPrio: freeSlots(minusRunning=True) code-wise. Number of pending jobs to be processed (not jobs pending in condor) keyed by the request priority, for each site. Affects data acquisition from LQ to WMBS.
  • thresholds: provides the state, the pending and running thresholds for each site.
  • thresholdsGQ2LQ: freeSlots(minusRunning=True) code-wise. Calculates the number of free slots for each site, the same call used for data acquisition from GQ to LQ. Divided in two huge queries that are aggregated to define the final available thresholds.
    • assigned jobs: check jobs with a location assigned and in several states, like created, *cooloff, executing, etc. Skips jobs Running in condor.
    • unassigned jobs: check jobs without a location assigned and in all states but killed and cleanup.
  • wmbsCountByState: number of webs jobs in each status. Data is cleaned up as workflows get archived.
  • wmbsCreatedTypeCount: number of wmbs jobs created for each job type.
  • wmbsExecutingTypeCount: number of wmbs jobs in executing state for each job type. An executing job can be either pending, running or just gone from condor.
  • total_query_time: time it took to collect all these SQL database information.

For LocalWQ_INFO, which shows information from the local workqueue (not from workqueue_inbox!) couchdb database:

  • workByStatus: collects top level Jobs information grouped by WQE Status from the local workqueue database (which is not replicated from cmsweb). Thus there can be work units in central workqueue in Acquired while they are all Available in local workqueue.
  • workByStatusAndPriority: collects top level Jobs information grouped by WQE Status and Priority from the local workqueue database (which is not replicated).
  • uniqueJobsPerSite: given WQEs in status Available or Acquired from the local workqueue database, calculates the number of unique top level jobs per site and status. Jobs are normalised among all the common sites (dataLocality=True, thus intersection of inputLocation + parentLocation + pileupLocation + sitewhitelist). E.g. WQE with 100 jobs with 4 common sites, each site has 25 jobs and 1 WQ element.
  • possibleJobsPerSite: given WQEs in status Available or Acquired from the local workqueue database, calculates the number of possible top level jobs per site and status. Any common site (dataLocality=True) is considered a potential site to run all the Jobs. E.g. WQE with 100 jobs with 4 common sites, each site has 100 jobs and 1 WQ element
  • total_query_time: time spent collecting all these workqueue metrics in the agent.

Description of the global workqueue metrics collected

These metrics are collected every 10 minutes from the Global WorkQueue service that runs on CMSWEB.

  • workByStatus: collects top level Jobs information grouped by WQE Status from the global workqueue database (which gets replicated to local workqueue_inbox one).
  • workByStatusAndPriority: collects top level Jobs information grouped by WQE Status and Priority from the global workqueue database (which gets replicated to local workqueue_inbox one).
  • workByAgentAndStatus: collects top level Jobs information grouped by Agent and Status from the global workqueue database (which gets replicated to local workqueue_inbox one). In case none of the agents has pulled work down (still in Available), then the agent should be marked as Null. "sum" is the total number of top level jobs, "count" is the number of WQE, "min" and "max" correspond to the minimum and maximum number of Jobs found in a single WQE.
  • workByAgentAndPriority: collects top level Jobs information grouped by Agent and Priority from the global workqueue database (which gets replicated to local workqueue_inbox one). In case none of the agents has pulled work down, then the agent should be marked as Null. "sum" is the total number of top level jobs, "count" is the number of WQE, "min" and "max" correspond to the minimum and maximum number of Jobs found in a single WQE.
  • uniqueJobsPerSiteAAA: given WQEs in status Available, Negotiating or Acquired from the global workqueue database, calculates the number of unique top level jobs per site and status. Jobs are normalised among all the possible sites (sitewhitelist - siteblacklist). E.g. WQE with 100 jobs with 4 possible sites, each site has 25 jobs and 1 WQ element.
  • possibleJobsPerSiteAAA: given WQEs in status Available, Negotiating or Acquired from the global workqueue database, calculates the number of unique top level jobs per site and status. Any possible site (sitewhitelist - siteblacklist) is considered a potential site to run all these Jobs. E.g. WQE with 100 jobs with 4 common sites, each site has 100 jobs and 1 WQ element
  • uniqueJobsPerSite: given WQEs in status Available, Negotiating or Acquired from the global workqueue database, calculates the number of unique top level jobs per site and status. Jobs are normalised among all the common sites (dataLocality=True, thus intersection of inputLocation + parentLocation + pileupLocation + sitewhitelist). E.g. WQE with 100 jobs with 4 common sites, each site has 25 jobs and 1 WQ element.
  • possibleJobsPerSite: given WQEs in status Available, Negotiating or Acquired from the global workqueue database, calculates the number of unique top level jobs per site and status. Any common site (dataLocality=True) is considered a potential site to run all the Jobs. E.g. WQE with 100 jobs with 4 common sites, each site has 100 jobs and 1 WQ element
  • total_query_time: time spent collecting all these global workqueue metrics on the cmsweb backend.

Use cases to be implemented at the visualization layer

Use case 1 - amount of work in the agents local queue

  • sub case 1: getting the number of ‘created’ jobs from ‘wmbsCountByState’. This gives us a hint of how much work/jobs JobSubmitter has in its local queue for submission to condor.
  • sub case 2: we could even split this info into ‘CPUbound’ x ‘IObound’ jobs, looking at ‘wmbsCreatedTypeCount’ and summing up all the ‘Processing’ and ‘Production’ jobs for the former case, all the job types are considered IObound jobs.
  • note: this information is useful both in an agent basis and aggregated to all the agents.

Use case 2 - amount of job failures

  • sub case 1: plot the number of non-final failures for each agent by looking at keys like ‘*cooloff’ (e.g. jobcooloff), from wmbsCountByState.
  • sub case 2: plot the number of final failures for each agent by looking at keys like ‘*failed’ (e.g. jobfailed), from wmbsCountByState.
  • note: both agent and aggregate view are interesting.

Use case 3 - amount of work in condor queue

  • sub case 1: checking the number of jobs running in condor by looking at ‘Running’ value for ’activeRunJobByStatus’.
  • sub case 2: checking the number of jobs pending in condor by summing up ‘New’ and ‘Idle’ values for ‘activeRunJobByStatus’.
  • note: both agent and aggregated view are important.

Use case 4 - idea of the wmagent load

  • sub case 1: the ‘total_query_time’ metric should give us a rough idea of how the agent and the SQL database performance is. If this arises to a level of minutes, then something is not healthy in the agent.