Suspend and resume simulations - UCL/TLOmodel GitHub Wiki

Being able to suspend a simulation and resume running it at a later time can be useful for a variety of reasons. For example, you may only have limited allocated running time on a shared compute facility, so need to suspend the simulation before it is terminated.

TLOmodel allows the suspension of the simulation at a specified date. The program will save the simulation, and all the associated state, to a file and terminate. The saved file can then be used to resume running the simulation at a later date.

Setup

TLOmodel can resume scenarios in two ways: (1) Resuming all the runs in a scenario and continuing on from when they were suspended and (2) Using the suspended simulations of runs from a specific draw to resume all runs in all draws.

To understand the second type of resumed runs, imagine you are investigating a scenario that has five draws. These five draws are mapped to particular interventions: (0) no intervention (1) intervention 1, starting 2020 (2) intervention 2, starting 2020 ... and so on. In this case, the simulation run before the intervention date of 2020 across all draws are identical i.e. period 2010-2020 is the same across all draws for the same run.

Here, we can reduce the compute time by doing the following:

Run the scenario with one draw (i.e. draw number 0) and the desired number of runs (say we're going to have 10 runs for each draw).
Suspend the simulation at a date before the intervention is applied. So, if the intervention is 2020-01-01, we suspend the simulation 2019-12-01.
Once these runs have completed, we update the scenario to include all 5 draws and the same number of runs (this can't be changed).
Resume the scenario runs, specifying the path to the suspended simulation and desired draw from which to resume runs.

Examples

Local run

To run a scenario, you use the command tlo scenario-run e.g.:

tlo scenario-run src/scripts/calibration_analyses/scenarios/long_run_all_diseases.py

This particular scenario runs from 2010-01-01 to 2031-01-01. To suspend the simulation run at 2020-06-01, we would provide the --suspend-date argument. Note that we have to supply the argument after the path to the scenario file.

tlo scenario-run src/scripts/calibration_analyses/scenarios/long_run_all_diseases.py --suspend-date 2020-06-01

When the program terminates, there will be two output files: the log file and a file named suspended_simulation.pickle. The last line in the log file will be a message like:

{"uuid": "8159d34ec0", "date": "2020-06-01T00:00:00", "values": ["Suspending simulation at 2020-06-01 00:00:00 and saving to outputs/long_run_all_diseases-2025-08-26T120743Z/0/0/suspended_simulation.pickle.Note, output file handle will be closed first & no more output logged"]}

To resume the running scenario, we supply the argument --resume-simulation with the path to the output of the suspended scenario. Note, the argument is added after the path to the scenario.

tlo scenario-run src/scripts/calibration_analyses/scenarios/long_run_all_diseases.py --resume-simulation outputs/long_run_all_diseases-2025-08-26T120743Z

To specify the draw from which to resume, we add it to the end. Instead of --resume-simulation outputs/long_run_all_diseases-2025-08-26T120743Z we use:

--resume-simulation outputs/long_run_all_diseases-2025-08-26T120743Z/0

where the draw number is appended to the end of the path (in this case "0").

The simulation will resume running to completion (the simulation end_date). At the top of the log file of the resumed run, you will see:

{"uuid": "8159d34ec0", "date": "2011-01-01T00:00:00", "values": ["Loading suspended simulation from outputs/long_run_all_diseases-2025-08-26T120743Z/0/0/suspended_simulation.pickle"]}

Azure Batch run

Simulation runs can be submitted to run on Azure Batch (configuration required) using tlo batch-submit.

To submit a scenario and have it suspend at a specified date, run the command with the --suspend-date argument. Note, the argument is added after the scenario file path.

tlo batch-submit src/scripts/calibration_analyses/scenarios/long_run_all_diseases.py --suspend-date 2020-06-01

When this command successfully executes, it will print the job identifier. This ID is needed to resume the suspended run, so make note of it!

Imagine the ID of the suspended run is long_run_all_diseases-2025-08-12T133044Z

Once the job completed, the output can be downloaded. Each run will have the log files, and also the saved suspended simulation. e.g.

% ls -1 long_run_all_diseases-2025-08-12T133044Z/0/0
long_run_all_diseases__2025-08-12T133252.log.gz
stderr.txt.gz
stdout.txt.gz
suspended_simulation.pickle
tlo.methods.contraception.log.gz
tlo.methods.contraception.pickle
tlo.methods.demography.log.gz
etc.

Warning: Every run in every draw will have its own suspended simulation file. If you have a large simulation and many runs, this may use significant space.

To resume the run, use the --resume-simulation argument and supply the job identifier (not the path to the output).

If we want to specify the draw from which to resume, we add it to the end. Instead of --resume-simulation long_run_all_diseases__2025-08-12T133252 we would use:

--resume-simulation long_run_all_diseases__2025-08-12T133252/0

where the draw number is appended as the end of the path (in this case "0").

tlo batch-submit src/scripts/calibration_analyses/scenarios/long_run_all_diseases.py --resume-simulation long_run_all_diseases__2025-08-12T133252

When the job successfully submits, it will print the job identifier for the resumed run. This will be different to your suspended job run. Once the job completes, the results can be downloaded as normal.

TODO

Analysing results requires having/downloading the outputs for the suspended (2010-2019) and resumed (2019-2031) simulation runs
How to join the outputs to get the full output from 2010-2031 for the run.