Running a WM job interactively ‐ manually - dmwm/WMCore GitHub Wiki

Requirements

In order to run a job interactively, you will need 3 pieces of information:

  1. a workflow + job id (also called as WMBS ID). This contains the job index to be used in the job package.
  2. the sandbox tarball for a given workflow
  3. the correct job package for a given workflow (and given job index)

To find 1. is simple, you can access it either through WMStats, or Condor Job Monitoring, or maybe even WMArchive. However, you still need to know which agent was processing this job, such that you can find the relevant job package pickle file.

With the workflow name, you need to connect to the WMAgent that created that job and fetch the workflow sandbox, e.g.:

$ ls install/wmagentpy3/WorkQueueManager/cache/WFLOW_NAME/WFLOW_NAME-Sandbox.tar.bz2

and the job package is indexed according to the WMBS job id, e.g. (for a job id 487517):

$ ls install/wmagentpy3/WorkQueueManager/cache/WFLOW_NAME/PackageCollection_0/batch_487517-0/JobPackage.pkl

Re-running a failed job interactively

The following example can be used with any job, provided that you have the Sandbox and job package for the failed job (which can be found in the job logfiles).

For example, the wmagentJob.log logfile will have the following job instance information:

2021-09-13 15:43:02,357:INFO:Startup:Loading job definition
2021-09-13 15:43:02,368:INFO:Bootstrap:Job Index = 808
Job Instance = {..., 'sandbox': '/data/tier0/admin/Specs/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLA**Y_2021_v2109131538_210913_1538/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLAY_2021_v2109131538_210913_1538-Sandbox.tar.bz2', 'jobType': 'Express', 'taskType': 't0', 'spec': '/data/tier0/admin/Specs/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLAY_2021_v2109131538_210913_1538/WMSandbox/WMWorkload.pkl', 'counter': 181, 'agentNumber': 0, 'ownerGroup': 'DEFAULT', 'ownerRole': 'DEFAULT', 'numberOfCores': 1, 'allowOpportunistic': False}

and the condor stdout logfile will have information like the following:

condor.33798.37.out:  TransferInput = "**/data/tier0/admin/Specs/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLAY_2021_v2109131538_210913_1538/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLAY_2021_v2109131538_210913_1538-Sandbox.tar.bz2**,**/data/tier0/admin/Specs/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLAY_2021_v2109131538_210913_1538/PackageCollection_0/batch_628-0/JobPackage.pkl**,/data/tier0/srv/wmagent/3.0.0/sw.jamadova/slc7_amd64_gcc630/cms/t0/3.0.0/lib/python3.8/site-packages/WMCore/WMRuntime/Unpacker.py"

Creating the environment

We will use a T0 job as an example:

  • Log in to e.g.: lxplus
  • From a particular job, get the sandbox and JobPackage files. E.g.:
sandbox=/afs/cern.ch/user/c/cmst0/public/PausedJobs/12_X_X/job_808/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLAY_2021_v2109131538_210913_1538-Sandbox.tar.bz2
package=/afs/cern.ch/user/c/cmst0/public/PausedJobs/12_X_X/job_808/JobPackage.pkl

Now, get the job index information. This is gotten from the condor job stderr and looks like this:

INFO:root:Job Index = 808

On lxplus, get the Unpacker and run it with job index you need.

sandbox=/afs/cern.ch/user/c/cmst0/public/PausedJobs/12_X_X/job_808/Express_Run317696_StreamALCALUMIPIXELSEXPRESS_Tier0_REPLAY_2021_v2109131538_210913_1538-Sandbox.tar.bz2
package=/afs/cern.ch/user/c/cmst0/public/PausedJobs/12_X_X/job_808/JobPackage.pkl
index=808
wget https://raw.githubusercontent.com/dmwm/WMCore/master/src/python/WMCore/WMRuntime/Unpacker.py
python Unpacker.py --sandbox=$sandbox --package=$package --index=$index

The Unpacker script will create the interactive environment you need to run your job.

Running your job in the new environment

The above will produce a job directory that you can transfer into an interactive worker node (lxplus works too, as the interactive worker, unless the job expects something (local data?) you can only get from a a particular site). Then, setup python and run

# Source python3
source /cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/etc/profile.d/init.sh
source /cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/py3-future/0.18.2/etc/profile.d/init.sh
source /cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/py3-setuptools/39.2.0/etc/profile.d/init.sh

# Run job:
cd job
export WMAGENTJOBDIR=$PWD
export PYTHONPATH=$PWD/WMCore.zip:$PWD:$PYTHONPATH
python3 Startup.py

If you need to edit the WMCore code: Unpack WMCore.zip, edit whatever files you need and zip the WMCore directory again. You will then be running on the modified WMCore code next time yoou run the Startup.py script.