Project to migrate WMAgent to a work unit level handling through late materialization of jobs - dmwm/WMCore GitHub Wiki

Overview

The idea is for the Workflow Management and Submission system to be able to dynamically adjust the size of jobs as needed. For example, if we match on a resource with high number of CPUs per node but limited number of jobs in the batch system (typical HPC case), we should be able to increase the length of many core jobs to use the CPUs as efficiently as possible. This requires being able to materialize the jobs only after the workflow has been matched/assigned to a set of resources, with the job length depending on some inputs defined by in the resource (e.g.: the job length in traditional resources vs HPC resources would be different).

We would like to have the WMAgents think as much as they can at a work unit/package level without thinking in the job level layer. Migrating WMAgent to manage these work units, such that each job in the end can be composed of one or more work units would be a solution.

Also, partial job success needs to handled in the system, returning the non-processed work units back to the system to create other jobs(s). Checkpointing within the CMSSW framework would need to be investigated and used here. Additionally, we need to think about how StepChain (multiple cmsRuns per job) impacts this process.

Regarding late materialization

HTCondor has a new feature that allows the late materialization of jobs in the Scheduler to enable submission of very large sets of jobs. More jobs are materialized once the number of idle jobs drops below a threshold, in a similar fashion to DAGMan throttling.

HTCondor versions supporting this

  • HTCondor 8.6.x (current versions on CERN schedds) does not support late materialization.
  • HTCondor 8.7.1 supports late materialization, except for condor_dagman.
  • HTCondor >8.7.4 / has condor_dagman support.

This is becoming part of the new 8.8 stable release branch

Documentation

At present, there is no documentation regarding late materialization, but there have been live demos of this functionality during HTCondor Weeks (8.8.0 was released in January, 2019, including this feature, but there is still no documentation of this).

However, the HTCondor github repo has some perl and python-based examples of it. So, it looks like not only condor_submit but also the python bindings support this feature.

Pool Configuration

Schedulers need to include the following in the condor configuration:

SCHEDD_ALLOW_LATE_MATERIALIZE = True
SCHEDD_MATERIALIZE_LOG = /path/to/file
SUBMIT_FACTORY_JOBS_BY_DEFAULT = True

Job submission examples

Using condor submit

Using this feature requires a single additional classad in the submit file:

# submit.jdl
universe = vanilla
executable = x_sleep.pl
arguments = 3
log = $submitfile.log
notification = never
max_materialize = 4 # Maximum number of materialized jobs in the queue
queue 10

and the submission is done via $ condor_submit <submit.jdl> -factory

Using python bindings

Python bindings can be used in the regular way, including the new classad option. E.g.:

import classad
import htcondor
sub = htcondor.Submit("""
	universe = vanilla
	executable = x_sleep.pl
	arguments = 3
	notification = never
	max_materialize = 4
	queue 10
	""")

logfile = 'logfile.log'
sub['log'] = logfile

schedd = htcondor.Schedd()
with schedd.transaction() as txn:
	cluster = sub.queue(txn)

Note: There also seems to be a max_idle option that can replace max_materialize.

Update regarding late materialization status

Multiple Submission Infrastructure test schedds have been upgraded to HTCondor 8.8.1 (including vocms0263.cern.ch), which include support for late materialization. After some tests, it looks like this feature doesn't seem to work with the Global Pool though. The reason is described below:

As of March 03 of 2019, late materialization does not support x509 proxy certificates and other features the regular submission mode does (it only works for the simplest job structures, according to developers). There is no defined schedule to complete this work, but the hopes are for it to be completed by the end of the 8.9 series.

  • At present, WMAgent uses the Schedd.SubmitMany(), which works with classads directly, so even though this should in principle bypass the x509 proxy issue, Schedd.SubmitMany() is not compatible with late materialization and we need htcondor.Submit() instead. In his words: "Late materialization is when you send the submit file to the schedd instead of the job classads – because creating job classads from a submit description IS materialization.".

Update: A fix has been added on github, this will likely be included on 8.9.1 or 8.9.2 A workaround for older versions requires modifying the schedd configuration as followed:

Edit e.g.: /etc/condor/config.d/99_local_tweaks.config, add:

use_x509userproxy = False
SYSTEM_SECURE_JOB_ATTRS =

Then, x509 classads can be added, e.g:

# submit.jdl
import classad
import htcondor
sub = htcondor.Submit("""
	universe = vanilla
	executable = test.sh
	arguments = 3
	notification = never
	max_materialize = 4
        +UserLogUseXML = True
        +x509userproxy = "/tmp/x509up_u77409"
        +x509UserProxyEmail = "[email protected]"
        +x509UserProxyExpiration = 1554577789
        +x509UserProxyFirstFQAN = "/cms/Role=NULL/Capability=NULL"
        +x509UserProxyFQAN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=khurtado/CN=764581/CN=Kenyi Paolo Hurtado Anampa,/cms/Role=NULL/Capability=NULL,/cms/uscms/Role=NULL/Capability=NULL"
        +x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=khurtado/CN=764581/CN=Kenyi Paolo Hurtado Anampa"
        +x509UserProxyVOName = "cms"
        +DESIRED_Sites = "T2_US_Nebraska,T2_US_UCSD,T1_US_FNAL,T2_US_Wisconsin,T1_ES_PIC,T2_IT_LegnaroTest,T2_CH_CERN"
        Queue 10
	""")

logfile = "log/logfile-$(Cluster).log"
sub['log'] = logfile

schedd = htcondor.Schedd()
with schedd.transaction() as txn:
	cluster = sub.queue(txn)
        print(cluster)

Note x509userproxy needs to prepend +, so the Schedd doesn't attempt to read its content, but treat it as a common classad insteadn.

We have a ticket to migrate to htcondor.Submit() in the agent.

Submitting jobs with late materialization enabled

Status: This is working by adding, e.g.:

max_materialize = 4

here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L618-L632

Results: HTCondor does the materialization part properly. I.e.: Only 4 jobs per clusterId are materialized. WMAgent then fails with:

error code: 99303 (Could not find jobReport)


WMAgent injects multiple jobs within the same clusterId. With "max_materialize = 4", Jobs with ProcId 0 to 3 are m aterialized. The other jobs are not even seen in the queue because they haven't materialized yet, so JobStatusLite fails on ProcId 4 and forward: 

2019-09-18 15:07:03,740:139940742534912:INFO:BossAirAPI:About to complete 387 jobs
2019-09-18 15:07:03,933:139940742534912:ERROR:SimpleCondorPlugin:No job report for job with id 34 and gridid 2557.4
2019-09-18 15:07:03,935:139940742534912:ERROR:SimpleCondorPlugin:No job report for job with id 35 and gridid 2557.5
2019-09-18 15:07:03,939:139940742534912:ERROR:SimpleCondorPlugin:No job report for job with id 36 and gridid 2557.6
2019-09-18 15:07:03,940:139940742534912:ERROR:SimpleCondorPlugin:No job report for job with id 37 and gridid 2557.7
2019-09-18 15:07:03,941:139940742534912:ERROR:SimpleCondorPlugin:No job report for job with id 38 and 

##### Investigation on the issue above

1. X different workflows are injected into WMAgent
2. At some point, WMAgent will submit N jobs, coming from those X workflows, all within the same condor ClusterId

This is because submit all jobs parameters collected in a cycle at once, without splitting per taskName:

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L162-L164

E.g.:

$ condor_history cmst1 -constraint 'ClusterId == 465' -af:h clusterid procid Arguments clusterid procid Arguments 465 30 wmagent_ReReco_badBlocks_khurtado_condorp883_ReReco_badBlocks_190829_171758_8086-Sandbox.tar.bz2 3800 0 465 4 wmagent_TaskChain_RelVal_Multicore_khurtado_condorp883_TaskChain_RelVal_Multicore_190829_172425_678-Sandbox.tar.bz2 3793 0 465 27 wmagent_ReReco_badBlocks_khurtado_condorp883_ReReco_badBlocks_190829_171758_8086-Sandbox.tar.bz2 3797 0 465 2 wmagent_TaskChain_RelVal_Multicore_khurtado_condorp883_TaskChain_RelVal_Multicore_190829_172425_678-Sandbox.tar.bz2 3789 0 465 8 wmagent_TC_PreMix_khurtado_condorp883_TC_PreMix_190829_171917_9684-Sandbox.tar.bz2 3795 0 465 31 wmagent_ReReco_badBlocks_khurtado_condorp883_ReReco_badBlocks_190829_171758_8086-Sandbox.tar.bz2 3801 0 465 1 wmagent_TaskChain_RelVal_Multicore_khurtado_condorp883_TaskChain_RelVal_Multicore_190829_172425_678-Sandbox.tar.bz2 3788 0 465 3 wmagent_TaskChain_RelVal_Multicore_khurtado_condorp883_TaskChain_RelVal_Multicore_190829_172425_678-Sandbox.tar.bz2 3792 0


3. If late materialization is enabled, with e.g.: max_materialize = 4, that means CONDOR_Q or the Schedd.xquery() method will only see 4 jobs in the queue at the beginning.
 
4. Then, BossAirAPI will track the jobs:
https://github.com/dmwm/WMCore/blob/8eee0ae655752aedf06f905294c1f55ccd505641/src/python/WMCore/BossAir/BossAirAPI.py#L473

Which will call the track method in SimpleCondorPlugin. So now, we can have jobs that are not materialized in the queue, and they will be flagged as completed here:

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L232-L237

5.The job report for each job will be searched, and there will be no PKL file for the jobs not materialized yet, so the complete() method will fail here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L292

 

##### HTCondor Classads when Late Materialization is enabled

The following classads are created on all of the jobs within the same ClusterId, when late materialization is enabled:

JobMaterializeDigestFile = "/data/srv/glidecondor/condor_local/spool/2557/condor_submit.2557.digest" JobMaterializeItemsFile = "/data/srv/glidecondor/condor_local/spool/2557/condor_submit.2557.items" JobMaterializeLimit = 4 JobMaterializeNextProcId = 4 JobMaterializeNextRow = 4

We could in principle use this information in the track logic

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L230-L238

so that we can recognize if a job has been materialized already or not, but is this enough?
⚠️ **GitHub.com Fallback** ⚠️