WMAgent refactoring to track work not jobs - dmwm/WMCore GitHub Wiki

Requirements

We are targeting to have a new mode of operation for WMAgent in place during LS2 and ready for Run3. We wish to make WMAgent more flexible in how it assigns work to leverage opportunistic or limited duration resources as well as new processing models such as event processors. These requirements are divided into stages of proposed development to deliver benefits along the way while working towards the final goal. A further goal is to reduce or eliminate the need for ACDC and recovery workflows, integrating these into the functionality of the main workflow.

Foundational requirements

These are statements of how development will proceed and the guarantees on data integrity we will keep at all times.

Development will proceed in stages where the results from each stage of development can be utilized in the production system even if all requirements are not met
Data structures and as much code as feasible will be designed such that sub-lumi work units will be supported should such functionality be possible with CMSSW
No job will process part of one lumi and part of another lumi. Either whole/multiple lumis or fractions of a lumi will be processed
Jobs declared failed must not, in any way, interact with or pollute subsequent jobs assigned (some of) the same work. Work from jobs declared failed must not be accepted or appear in final datasets. Covers pre-emption and other cases.

Reimplementing current capabilities while tracking by lumi

WMAgent will track work to be done for each task in a workflow. Work is defined as the events and/or lumis to be processed or generated. Work will be tracked in the system by task (or another workflow related unit), run, lumi (or range), and event range
Work from jobs will be able to be marked successful, failed, or not attempted.
Work requested will be timestamped and timeouts applied such that work is declared failed if it does not complete in time. Stopping the initial job is not a pre-requisite.
Monitoring will be by work units rather than jobs
We will keep the ability to do resubmissions of jobs as-is before returning work to the queue for maximum flexibility.

Dynamically splitting resubmitted work

The work from failed jobs is returned to the list of work needed to be done.
Job splitting for a workflow will be a continuous process until all work is either completed or failed
The amount of work assigned to a job will be dynamic adapting to results of the workflow and with an attempt to minimize "tails"
WMAgent will continue to attempt to complete work in a workflow until the smallest work unit (initially a lumi) has been assigned to a dedicated job at least once. Only then will be be marked "not completable"
Work will be assigned to jobs or pseudo-jobs (hereafter "jobs") in a late binding process such that the relationship between jobs and work is not static

Accommodating partially complete jobs

All work assigned to a job will not have to be marked the same.

Sub-lumi processing and non-grid jobs

Pseudo-jobs will be supported where units of work may be delegated to intermediate systems for processing. I.e. WMAgent does not submit a job, just registers the state of work and an API is provided to ask for and update available work.
Partially reconstructed lumis must be merged together into a single file in the next step or at the end of the workflow. A failure in any part of a lumi means the entire lumi is discarded to maintain full-lumi accounting.

Stretch requirements

The ability to have multiple workers attempting to complete the same work and both be considered active.

Milestones

Requirements agreed to
Requirements ordered by when functionality is needed
Data structures defined and prototyped to verify scalability
Agent with static binding to jobs but using new data structures works
???
Profit

Other ideas

In order to minimize tails, we could bump the priority of every retried job (by the retry number). Need to sort out the impact on Condor of such a process (limit the unique # of job requirements.)
Data location would be a constant process of active workflows, such that the assigner don't need to set a site white list anymore (only black list, if needed). The agent would be constantly updating the database and submitting jobs to where a block is available. Should also be mindful of data federation without hacks in the agent.
Still on the data location, we could apply the same for intermediate output samples (taskChain cases). One could transfer these intermediate output data to other sites in order to split work for the subsequent tasks.