JobManagement - radiasoft/sirepo GitHub Wiki

Job Management

Components and Communication

sirepo.job_api is the entry point into the job system for the GUI. This consists of HTTP requests to the supervisor, which runs as a separate Tornado process.

The supervisor sends Ops (search: class _Op) via drivers, which will start a job_agent process.

Ops are sent asynchronously over a single Websocket connection to each job_agent. An Op has at least one response, and sometimes multiple responses (repeated status updates). This means that Ops are a "state" to be managed.

Some API calls result in multiple Ops, but most map one-to-one to a single Op.

Due to legacy of how the GUI works, APIs are sent that have modal responses, specifically runSimulation, which results either in a status response or a result (for sequential executions only).

Ops can be queued or pending (already sent to job_agent). API replies are always synchronous.

API/Op State Transitions

Issue #2164 caused us to step back and evaluate the state transitions. The following maps state transitions due to API events when a job has a particular Op pending or queued.

OP_ANALYSIS is a single entry point, which results in a subprocess call to job_cmd with an entry point for each jobCmd: cancel, compute, get_simulation_frame, prepare_simulation, sbatch_status, sequential_result. These will be referred to by name with the prefix job_cmd. to imply OP_ANALYSIS. More confusing is that OP_RUN results in job_cmd.compute and sometimes also a job_cmd.sbatch_status, but for state transitions in the Supervisor, a pending OP_RUN is sufficient.

The tables below have links to actions which are define further down.

API: downloadDataFile

Op	Action
None	send_or_not_found
OP_CANCEL	queue
OP_RUN	send_or_not_found
job_cmd.get_data_file	queue
job_cmd.get_simulation_frame	queue
job_cmd.sequential_result	queue

API: jobSupervisorPing

Special API: HTTP request to return state of supervisor synchronously, independent of a particular job.

API: runCancel

Op	Action
None	reply_canceled
OP_CANCEL	reply_canceled
OP_RUN	cancel
job_cmd.get_data_file	reply_normally
job_cmd.get_simulation_frame	reply_normally
job_cmd.sequential_result	discard_and_cancel

API: runSimulation

Op	Action
None	run_or_result
OP_CANCEL	queue
OP_RUN	status_or_collision
job_cmd.get_data_file	queue_or_reply
job_cmd.get_simulation_frame	queue_or_reply
job_cmd.sequential_result	queue_or_reply

API: runStatus

Op	Action
None	status_or_result
OP_CANCEL	status
OP_RUN	status
job_cmd.get_data_file	status
job_cmd.get_simulation_frame	status
job_cmd.sequential_result	status

API: sbatchLogin

All other APIs should be queued until the sbatchLogin completes.

If sbatchLogin comes in, and there are ops pending or queued, error. The agent is alive.

API: simulationFrame

Op	Action
None	send_or_not_found
OP_CANCEL	queue
OP_RUN	send_or_not_found
job_cmd.get_data_file	queue
job_cmd.get_simulation_frame	queue
job_cmd.sequential_result	queue

Action: cancel

If the computeJobHash and computeJobSerial are valid or already exited, set the state to canceled. Send a OP_CANCEL to the agent. Wait for the reply.

If an OP_RUN is in the queue or pending, discard_and_cancel.

Action: discard_and_cancel

reply_canceled, discard normal reply when it comes in.

If the op is queued, then discard the op, and reply_canceled.

Action: queue

Put the translated op in the driver's queue (ops_pending_send). e.g. runCancel enqueues OP_CANCEL.

Action: queue_or_reply

If the request has a valid hash and computeJobSerial or force, queue the Op.

Otherwise, reply collision.

Action: reply_canceled

Reply canceled (even if the hash/serial is invalid).

Action: reply_normally

If parallel, wait for the Op to complete and return the reply as received (no need to check hash).

A cancel will not discard a queued API bound to this action.

If sequential, discard_and_cancel.

Action: status

If the computeJobSerial and computeJobHash are valid, reply immediately with the status of the ComputeJob.

Otherwise, reply missing.

Action: status_or_collision

If running or pending and force or the computeJobHash/Serial does not match, reply collision. The job is already running from another GUI.

If force or mismatch or not in completed state, run the simulation.

Otherwise, status_or_result.

Action: run_or_result

If force or computeJobHash/Serial does not match, send an OP_RUN and reply with status.

Otherwise, status_or_result.

Action: send

Send the translated op and wait for the reply.

Action: send_or_not_found

If is parallel, the computeJobHash/Serial matches, and the status is running, pending, canceled, or completed:

If no other OP_ANALYSIS in the queue, send.
Otherwise, queue.

If is sequential, and the status is completed, send unless there is an OP_ANALYSIS ahead of this API, then queue.

Otherwise, reply not found, because the job is still running or does not match.

Action: status_or_result

If the computeJobSerial and computeJobHash are valid, reply status for parallel jobs, else send job_cmd.sequential_result and reply with value received.

Otherwise, reply missing.

Types of Jobs

Sequential Jobs

Sequential jobs begin running immediately when a user visits the simulation page. The computeModel for a sequential job is a report. On the frontend, they are known as transient.

Parallel Jobs

Parallel jobs run when a user presses a "Start new simulation" button. The computeModel for a parallel job is an animation. On the frontend, they are known as persistent.