Draining Steps - dmwm/WMCore GitHub Wiki

What's the impact of an agent in drain mode

When an agent is set to drain mode, this is what happens:

  • it will NOT pull any new work from central WorkQueue (but it will still acquire local WQE)
  • it maximizes the pending slots for all sites, as if there was a single agent connected to the same team (normally pending thresholds are distributed among the agents sharing the same team).

How to drain an agent

In short perform the following steps:

  1. log into your agent
  2. execute the following command:
agentExtraConfig='{"UserDrainMode":true}'
$manage execute-agent wmagent-upload-config $agentExtraConfig

and it will put agent into the drain mode. For more details please keep reading further.

Set the agent to drain mode via reqmgr_aux agent document:

"UserDrainMode": True

This means the agent won't pull any workqueue elements from global workqueue. It will only process what's already in the LQ, maximize the site thresholds and try to finish those jobs as soon as possible.

The draining configuration used to be in the local config.py file, it has since ~1.1.15 been moved to central couch db, to the reqmgr_auxiliary database. If you decide to take the same agent out of drain, you need to run the same command but instead of setting it to true, just set it to false.

Once the agent is in drain mode, it starts running a DrainStatusPoller thread which checks the status of the agent and reports it back to WMStats. When it reports that:

  • all workflows have completed;
  • there are no more condor jobs;
  • all dbs files have been updated and blocks closed;
  • and all files have been injected into PhEDEx;

then the agent should be ready to be recycled and shutdown. In order to completely shutdown the agent and stop tracking it in WMStats, we need to:

$manage stop-agent
$manage stop-services
$manage execute-agent wmagent-unregister-wmstats `hostname -f`
### and if the agent is running on oracle backend, then we also need to clean up the database with
### WARNING!!!! this cannot be recovered
$manage execute-agent clean-oracle

Automatic speedy draining

This process has been automated and integrated to WMCore in: https://github.com/dmwm/WMCore/pull/8555 which added a new flag to the configuration in central couch

"SpeedDrainMode": false,

and this flag is NOT meant to be touched by humans, but only by the agent itself once thresholds are hit. IF we want to manually enable the speed draining process, then we would need to actually set the following thresholds to a very high number:

   "SpeedDrainConfig": {
       "NoJobRetries": {"Threshold": 200, "Enabled": false},
       "EnableAllSites": {"Threshold": 200, "Enabled": false},
       "CondorPriority": {"Threshold": 500, "Enabled": false}
   },

Once the agent enters in the speed drain mode, it evaluates the amount of jobs in the system and the thresholds set, in case it passes each of those conditions, it can automatically apply each of the following changes to the agent:

  • ALL Production and Processing pending jobs have their JobPrio parameter updated to the highest possible priority 999999; this process happens every time the DrainStatusPoller thread runs;
  • MaxRetries is set to 0 in the central couch configuration, thus none of the jobs will be retried again;
  • it enables all sites, meaning that JobSubmitter can submit jobs to any sites despite their status in the resource-control database.

When any of those thresholds are hit, the agent starts setting them as enabled as well.

Manual speed draining

In case someone wants to speed up the draining process without directly touching the speed draining configuration, one can apply one of the following steps.

Setting maxRetries to 0

Some components functionalities were moved to the reqmgr_auxiliary central DB, so the new way to set the maximum number of job retries to 0 is to update the specific agent document in central couchdb. In order to accomplish that, run the following command:

curl --cert $X509_USER_CERT --key $X509_USER_KEY -X PUT -H "Content-type: application/json" -d '{"MaxRetries":0}' https://cmsweb.cern.ch/reqmgr2/data/wmagentconfig/`hostname -f`

there is no need to restart any components, they will pick up the new configuration in their next cycle.

Setting maxRetries to 0 in legacy/T0 agents

Set maxRetries to 0 such that any failures will be terminal and ACDC documents will be created. In the agent configuration file config/wmagent/config.py, replace this line :

config.ErrorHandler.maxRetries = {'default': 3, 'Merge': 4, 'Cleanup': 2, 'LogCollect': 1, 'Harvesting': 2}

by:

config.ErrorHandler.maxRetries = 0

Now restart ErrorHandler and RetryManager

$manage execute-agent wmcoreD --restart --component ErrorHandler,RetryManager

Enabling all sites

In order to make these changes, we need to first disable the AgentStatusWatcher component, otherwise it will automatically set sites to their respective status in SSB (drain, normal, etc) and set the thresholds according. In the agent configuration file config/wmagent/config.py, replace this line:

config.AgentStatusWatcher.enabled = True

by:

config.AgentStatusWatcher.enabled = False

Then restart AgentStatusWatcher

$manage execute-agent wmcoreD --restart --component AgentStatusWatcher

Now we can manually update a couple of tables in the SQL db. Open MariaDB/Oracle prompt with

$manage db-prompt wmagent

and execute the following update statements (it enables ALL sites and set thresholds to a reasonable number):

UPDATE wmbs_location SET state=(SELECT id from wmbs_location_state where name='Normal') WHERE state!=(SELECT id from wmbs_location_state where name='Normal');
UPDATE wmbs_location SET running_slots=1000, pending_slots=1000;
UPDATE rc_threshold SET max_slots=1000, pending_slots=1000;

Rolling a new WMAgent version to production

This section contains a short description of how we push a new WMAgent release to production, which is different than simply upgrading an agent.

First, some background information which is useful for a better understanding of this procedure.

  • A new WMAgent stable version is released every 2 months (sometimes 2.5 months);
  • new version for central services is released and deployed in CMSWEB every month;
  • a new WMAgent release candidate is made available right after the CMSWEB production upgrade (within 2 days);
  • then the validation process starts and it takes in average a week to validate and fix any issues, before we cut the final stable release;
  • from this point on, we have another branch to maintain and make sure that important fixes are backported to it (in addition to the master branch).

Given the background information above, we can have two different WMAgent upgrade scenarios:

  1. there are (severe) breaking changes in the agents (usually related to reqmgr2 and/or workqueue, or related to the database schema) and we can't have a mix of agents pulling work from WorkQueue. Thus all the old agents have to be put into drain at the same time and new agent releases are moved to up&running.
  2. the most common scenario, where there are no breaking changes and both WMAgent versions can coexist and pull work from global workqueue

No matter the scenario, the first goal is to have the old release replaced by the new one ASAP. Such that we don't need to maintain 3 different branches, debugging old issues, backporting bug fixes from different releases, etc.

In addition to that, we should not have WMAgents falling (too) behind with new developments made in ReqMgr2 and WorkQueue, thus, ideally, we should have ALL old releases at least in drain mode until the next CMSWEB upgrade date, much better if they are off of the grid, of course. Expanding on it, there is a new CMSWEB release every ~4 weeks, validation/bug fix takes a week'ish, that gives us 3 weeks to get all agents rotated and 5 weeks left to run a pool of new WMAgent releases (until the next upgrade starts).