Monitoring the Jobs - TreeMaker/TreeMaker GitHub Wiki
-
Log into CMS Connect and go to your working area, but do not enter a container and do not call
cmsenv
(replace [username] with your CMS Connect username):ssh [username]@login-el7.uscms.org cd /scratch/`whoami`/CMSSW_10_6_29_patch1/src/TreeMaker/Production/test/myProd
-
As often as you prefer (at least several times per day), check on the status of the jobs:
condor_q `whoami` -totals
-
If there are any held jobs, check why they are held:
python manageJobs.py -howm -u `whoami`
It is important to
cd
to yourmyProd
directory and runcmsenv
beforemanageJobs.py
, in order to pick up the proper configuration from your.prodconfig
file. This specific command lists, for every held job (-h
): the output log name (-o
), why it is held (-w
), and the site and machine where it ran (-m
). If you see a large number of failed jobs at a specific site, it may be a "black hole". Please report this to the list so others can avoid it. Common error codes for xrootd failures are 84 and 85. Other exit codes most often indicate transient failures and can be ignored. -
Release any held jobs to run again:
python manageJobs.py -hs -u `whoami`
If you need to remove a black hole site, you can use an extra argument (filling in [site1,site2,...] with a comma-separated list of black hole sites):
python manageJobs.py -hs --rm-sites [site1,site2,...] -u `whoami`
-
Every day, you should also check for stuck jobs, which are still running on a worker node, but no longer active. To do this, replace
-h
with-t
in the commands from steps 3 and 4. -
Reply to the list when most of your jobs are finished to receive further instructions.
Tip: if you want to avoid the need for -u `whoami`
in your manageJobs.py
commands, you can edit your "global" configuration file ~/.prodconfig
to include the following lines:
[common]
user = username
replacing username
with your username.