WMAGENT STATE MACHINE - dmwm/WMCore GitHub Wiki

This page describes the different status for a job in WMAgent

State transitions

State diagram not displayed

Before a job is created

WorkQueueManager pull down the files from DBS (in case that there is input files) or create the fake files and feed to WMBS. Files grouped as fileset and associate to subscription which organized by same task, location, splitting algorithm)

BossAir is the db for batch system status. (sub status of jobs submitted to batch system. i.e. condor)

States

new

When JobCreator create jobs using selected splitting algorithm, jobs are set to new state.

created

JobCreator organize the jobs by group and set the states to created. Also RetryManager can create a job which is previously failed. (Actually just move the state to created not doing job splitting over again)

executing

JobSubmitter picks up the jobs created and submit the jobs to batch systme change to executing. This state contains multiple sub states in batch system. Those sub states map to "Pending", "Running", "Complete", "Error" which is tracked by JobStatusLite.

complete

JobTracker collect job status information from BossAir db which is updated by JobStatusLite and set the state to complete or jobfailed (for "Error" status for BossAir with no job report).

success

JobAccountant polls the completed jobs and parse framework job report to determine the jobs' state. Then it moves the jobs to success or jobfailed

jobfailed

JobAccount move the complete job to jobfailed state to if the framework job report indicates failure. In specific condition described in complete section, Jobtracker also move the state to jobfailed.

createfailed

JobCreator can failed the job (i.e. if there are errors found during the splitting). For this case jobs are still created but has parameter (failedOnCreation) marked as True. Failed framework job report is created for those failed jobs before state transition occurs. created -> createfailed.

submitfailed

JobSubmitter can failed the job in some cases (i.e no site available) created -> submitfailed.

jobcooloff, submitcooloff, createcooloff

ErrorHandler collects all the failed jobs and move the status to cooloff status depending on the retry count settings.

retrydone

ErrorHandler moves the failed jobs to retrydone when maximum retry is reached.

exhausted

ErrorHandler moves the retrydone jobs to exhausted status after all failed files are uploaded to ACDC.

killed

Jobs can be killed from any states except success, exhausted and cleanout states It is triggered manually by abort(ing) and force-complete (ing) request. Then it is picked up by WorkQueueManager and trigger the killing process

cleanout

JobArchiver archives the logs and other job information (pickled job and report) for jobs in success, exhausted and killed states. Then move the jobs to cleanout state.

Post procedure (After a job reaches in cleanout state)

TaskArchiver collects information by task then workflows, update the summary to workloadsummary db and propagate state to GlobalWorkQueue. Cleanup all the data in local DB (Oracle/Maria DB, CouchDB) as well as temporary files in disk.