WMAgent - dmwm/WMCore GitHub Wiki

The Basics

The WMAgent software is a distributed component of the production system, in a nutshell its functions are:

  • Splitting WorkQueue elements into smaller basic work units, known as jobs.
  • Creating jobs and controlling the flow of work according for the tasks defined in the workload of a request.
  • Submitting jobs to a batch system (e.g. HTCondor, LSF).
  • Tracking the submitted jobs and keeping tabs on their outcome.
  • Registering the produced data into the CMS catalogs (i.e. DBS2/3, PhEDEx).

The Databases

The WMAgent relies in two services for its operation:

  • A relational database to keep the WMAgent state, known as WMBS.
  • A non-relational database for monitoring and document storage, the current implementation uses CouchDB.

WMBS (Workload Management Bookkeeping System)

Document storage (CouchDB)

The WMAgent as state transition system for jobs

WMAgent State Transition Diagram

The WMComponents

The WMAgent is made up of threaded WMComponents which function independently and use WMBS and CouchDB as their sources of information, some of them interact with external services such as PhEDEx, ReqMgr, DBS, WorkQueue, or SiteDB.

WorkQueueManager

JobCreator

JobSubmitter

JobStatusLite

JobTracker

JobUpdater

JobAccountant

ErrorHandler

RetryManager

JobArchiver

TaskArchiver

AnalyticsDataCollector

AgentStatusWatcher

ArchiveDataReporter

DBS3Upload

PhEDExInjector

Future developments

WMAgentRefactor

A set of usable initial commands to use once loged in to an agent:

  • Initial login to the machine:
[user@vocms0290]$ cmst1 
cmst1@vocms0290:/afs/cern.ch/user$ agentenv
  • Machine and components status management:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage status
cmst1@vocms0290:/data/srv/wmagent/current$ $manage stop-services
cmst1@vocms0290:/data/srv/wmagent/current$ $manage start-services
cmst1@vocms0290:/data/srv/wmagent/current$ $manage stop-agent
cmst1@vocms0290:/data/srv/wmagent/current$ $manage start-agent
  • Restart a subset of the agent's components:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmcoreD --restart --component JobAccountant,RucioInjector
  • Unregister an agent from WMCore central services:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-unregister-wmstats `hostname -f`
  • Check or add resources to the agent's resource control database:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_HLT -p
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-resource-control --plugin=SimpleCondorPlugin --opportunistic --pending-slots=1000 --running-slots=2000 --add-one-site T3_ES_PIC_BSC
  • Use the internal configuration and sql client to connect to the current agent's dataebase:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage db-prompt wmagent

Optionally you may use the rlwrap tool, if available at the agent, in order to have a proper console output wrapper and history. e.g.:

cmst1@vocms0290:/data/srv/wmagent/current$ rlwrap -m -pgreen -H /data/tmp/.sqlplus.hist $manage db-prompt

  • Kill a workflow at the agent:
cmst1@vocms0290:/data/srv/wmagent/current $ $manage execute-agent kill-workflow-in-agent <FIXME:workflow-name> 

The WMAgent tree

  • Minimal depth of the WMAgent tree, starting from the currentdeployment
cmst1@vocms0290:/data/srv/wmagent/current $ tree -lL 3
.
├── apps -> apps.sw
│   ├── wmagent -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3
│   │   ├── bin
│   │   ├── data
│   │   ├── doc
│   │   ├── etc
│   │   ├── lib
│   │   ├── xbin
│   │   ├── xdata
│   │   ├── xdoc
│   │   └── xlib
│   └── wmagentpy3 -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3  [recursive, not followed]
├── apps.sw
│   ├── wmagent -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3  [recursive, not followed]
│   └── wmagentpy3 -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3  [recursive, not followed]
├── auth
├── bin
├── config
│   ├── couchdb
│   │   └── local.ini
│   ├── mysql
│   │   └── my.cnf
│   ├── rucio
│   │   └── etc
│   ├── wmagent -> ../config/wmagentpy3
│   │   ├── config.py
│   │   ├── config.py~
│   │   ├── config-template.py
│   │   ├── deploy
│   │   ├── local.ini
│   │   ├── manage
│   │   ├── my.cnf
│   │   ├── __pycache__
│   │   └── rucio.cfg
│   └── wmagentpy3
│       ├── config.py
│       ├── config.py~
│       ├── config-template.py
│       ├── deploy
│       ├── local.ini
│       ├── manage
│       ├── my.cnf
│       ├── __pycache__
│       └── rucio.cfg
├── install
│   ├── couchdb
│   │   ├── certs
│   │   ├── database
│   │   └── logs
│   ├── mysql
│   │   ├── database
│   │   └── logs
│   └── wmagentpy3
│       ├── AgentStatusWatcher
│       ├── AnalyticsDataCollector
│       ├── ArchiveDataReporter
│       ├── DBS3Upload
│       ├── ErrorHandler
│       ├── JobAccountant
│       ├── JobArchiver
│       ├── JobCreator
│       ├── JobStatusLite
│       ├── JobSubmitter
│       ├── JobTracker
│       ├── JobUpdater
│       ├── RetryManager
│       ├── RucioInjector
│       ├── TaskArchiver
│       └── WorkQueueManager
└── sw
    ├── bin
    │   ├── cmsarch -> ../common/cmsarch
    │   ├── cmsos -> ../common/cmsarch
    │   └── scramv1 -> ../common/scramv1
    ├── bootstrap.sh
    ├── bootstrap-slc7_amd64_gcc630.log
    ├── bootstraptmp
    ├── cmsset_default.csh
    ├── cmsset_default.sh
    ├── common
    │   ├── cmsarch
    │   ├── cmsos
    │   ├── cmspkg
    │   ├── migrate-cvsroot
    │   ├── scram
    │   ├── scramv0 -> scram
    │   └── scramv1 -> scram
    ├── data -> /data
    │   ├── admin
    │   ├── certs
    │   ├── khurtado
    │   ├── lost+found
    │   ├── srv
    │   └── tmp
    ├── etc
    │   └── cms-common
    ├── share
    │   └── cms
    └── slc7_amd64_gcc630
        ├── cms
        ├── etc
        ├── external
        ├── tmp
        └── var

  • All component logs can be found here:
cmst1@vocms0290:/data/srv/wmagent/current $ ls -ls /data/srv/wmagent/current/install/wmagentpy3/*/ComponentLog
827896 -rw-r--r--. 1 cmst1 zh 847759271 Aug 24 19:54 /data/srv/wmagent/current/install/wmagentpy3/AgentStatusWatcher/ComponentLog
 13484 -rw-r--r--. 1 cmst1 zh  13799746 Oct 19 08:38 /data/srv/wmagent/current/install/wmagentpy3/AnalyticsDataCollector/ComponentLog
  4244 -rw-r--r--. 1 cmst1 zh   4337901 Oct 19 08:40 /data/srv/wmagent/current/install/wmagentpy3/ArchiveDataReporter/ComponentLog
  4092 -rw-r--r--. 1 cmst1 zh   4182158 Sep  1 16:23 /data/srv/wmagent/current/install/wmagentpy3/DBS3Upload/ComponentLog
 11412 -rw-r--r--. 1 cmst1 zh  11680500 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/ErrorHandler/ComponentLog
  3560 -rw-r--r--. 1 cmst1 zh   3640859 Oct 19 08:42 /data/srv/wmagent/current/install/wmagentpy3/JobAccountant/ComponentLog
 17716 -rw-r--r--. 1 cmst1 zh  18136882 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/ComponentLog
 11240 -rw-r--r--. 1 cmst1 zh  11504668 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobCreator/ComponentLog
 21708 -rw-r--r--. 1 cmst1 zh  22220852 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobStatusLite/ComponentLog
 49336 -rw-r--r--. 1 cmst1 zh  50512403 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobSubmitter/ComponentLog
 26964 -rw-r--r--. 1 cmst1 zh  27606966 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobTracker/ComponentLog
 16576 -rw-r--r--. 1 cmst1 zh  16966263 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobUpdater/ComponentLog
 14368 -rw-r--r--. 1 cmst1 zh  14707697 Oct 19 08:45 /data/srv/wmagent/current/install/wmagentpy3/RetryManager/ComponentLog
 55756 -rw-r--r--. 1 cmst1 zh  57089235 Oct 19 08:41 /data/srv/wmagent/current/install/wmagentpy3/RucioInjector/ComponentLog
 22684 -rw-r--r--. 1 cmst1 zh  23221159 Oct 19 08:42 /data/srv/wmagent/current/install/wmagentpy3/TaskArchiver/ComponentLog
600168 -rw-r--r--. 1 cmst1 zh 614565975 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/WorkQueueManager/ComponentLog