JobManagement - radiasoft/sirepo GitHub Wiki
Job Management
Components and Communication
sirepo.job_api is the entry point into the job system for the GUI. This consists of HTTP requests to the supervisor, which runs as a separate Tornado process.
The supervisor sends
Ops (search: class _Op)
via drivers,
which will start
a job_agent process.
Ops are sent asynchronously over a single Websocket connection to each job_agent. An Op has at least one response, and sometimes multiple responses (repeated status updates). This means that Ops are a "state" to be managed.
Some API calls result in multiple Ops, but most map one-to-one to a single Op.
Due to legacy of how the GUI works, APIs are sent that have modal responses, specifically runSimulation, which results either in a status response or a result (for sequential executions only).
Ops can be queued or pending (already sent to job_agent). API replies are always synchronous.
API/Op State Transitions
Issue #2164 caused us to step back and evaluate the state transitions. The following maps state transitions due to API events when a job has a particular Op pending or queued.
OP_ANALYSIS is a single entry point, which results in a subprocess
call to
job_cmd
with an entry point for each jobCmd: cancel, compute,
get_simulation_frame, prepare_simulation, sbatch_status,
sequential_result. These will be referred to by name with the prefix job_cmd. to
imply OP_ANALYSIS. More confusing is that OP_RUN results in
job_cmd.compute and sometimes also a job_cmd.sbatch_status, but
for state transitions in the Supervisor, a pending OP_RUN is sufficient.
The tables below have links to actions which are define further down.
API: downloadDataFile
| Op | Action |
|---|---|
| None | send_or_not_found |
| OP_CANCEL | queue |
| OP_RUN | send_or_not_found |
| job_cmd.get_data_file | queue |
| job_cmd.get_simulation_frame | queue |
| job_cmd.sequential_result | queue |
API: jobSupervisorPing
Special API: HTTP request to return state of supervisor synchronously, independent of a particular job.
API: runCancel
| Op | Action |
|---|---|
| None | reply_canceled |
| OP_CANCEL | reply_canceled |
| OP_RUN | cancel |
| job_cmd.get_data_file | reply_normally |
| job_cmd.get_simulation_frame | reply_normally |
| job_cmd.sequential_result | discard_and_cancel |
API: runSimulation
| Op | Action |
|---|---|
| None | run_or_result |
| OP_CANCEL | queue |
| OP_RUN | status_or_collision |
| job_cmd.get_data_file | queue_or_reply |
| job_cmd.get_simulation_frame | queue_or_reply |
| job_cmd.sequential_result | queue_or_reply |
API: runStatus
| Op | Action |
|---|---|
| None | status_or_result |
| OP_CANCEL | status |
| OP_RUN | status |
| job_cmd.get_data_file | status |
| job_cmd.get_simulation_frame | status |
| job_cmd.sequential_result | status |
API: sbatchLogin
All other APIs should be queued until the sbatchLogin completes.
If sbatchLogin comes in, and there are ops pending or queued, error. The agent is alive.
API: simulationFrame
| Op | Action |
|---|---|
| None | send_or_not_found |
| OP_CANCEL | queue |
| OP_RUN | send_or_not_found |
| job_cmd.get_data_file | queue |
| job_cmd.get_simulation_frame | queue |
| job_cmd.sequential_result | queue |
Action: cancel
If the computeJobHash and computeJobSerial are valid or already exited, set the state to canceled. Send a OP_CANCEL to the agent. Wait for the reply.
If an OP_RUN is in the queue or pending, discard_and_cancel.
Action: discard_and_cancel
reply_canceled, discard normal reply when it comes in.
If the op is queued, then discard the op, and reply_canceled.
Action: queue
Put the translated op in the driver's queue (ops_pending_send). e.g. runCancel enqueues OP_CANCEL.
Action: queue_or_reply
If the request has a valid hash and computeJobSerial or force, queue the Op.
Otherwise, reply collision.
Action: reply_canceled
Reply canceled (even if the hash/serial is invalid).
Action: reply_normally
If parallel, wait for the Op to complete and return the reply as received (no need to check hash).
A cancel will not discard a queued API bound to this action.
If sequential, discard_and_cancel.
Action: status
If the computeJobSerial and computeJobHash are valid, reply immediately with the status of the ComputeJob.
Otherwise, reply missing.
Action: status_or_collision
If running or pending and force or the computeJobHash/Serial does not match, reply collision. The job is already running from another GUI.
If force or mismatch or not in completed state, run the simulation.
Otherwise, status_or_result.
Action: run_or_result
If force or computeJobHash/Serial does not match, send an OP_RUN and reply with status.
Otherwise, status_or_result.
Action: send
Send the translated op and wait for the reply.
Action: send_or_not_found
If is parallel, the computeJobHash/Serial matches, and the status is running, pending, canceled, or completed:
If is sequential, and the status is completed, send unless there is an OP_ANALYSIS ahead of this API, then queue.
Otherwise, reply not found, because the job is still running or does not match.
Action: status_or_result
If the computeJobSerial and computeJobHash are valid,
reply status for parallel jobs, else send job_cmd.sequential_result and
reply with value received.
Otherwise, reply missing.
Types of Jobs
Sequential Jobs
Sequential jobs begin running immediately when a user visits the simulation page. The computeModel for a sequential job is a report. On the frontend, they are known as transient.
Parallel Jobs
Parallel jobs run when a user presses a "Start new simulation" button. The computeModel for a parallel job is an animation. On the frontend, they are known as persistent.