Running network models with checkpoints - EpiModel/EpiModeling GitHub Wiki
Checkpointed Models
A checkpointed model is a model that saves it state at regular interval and that is able to restart from the last checkpoint. A checkpoint system to prevent loosing all computation if the model is interrupted (SIGINT, power loss, time limit exceeded on a computation cluster).
From version 2.3.0 EpiModel::netsim
offers a checkpoint system which is
enabled by setting two control arguments to EpiModel::control.net
:
- ".checkpoint.steps": a positive number of steps to be run between each save
- ".checkpoint.dir": the path to a directory to save the checkpoints
# Minimal checkpoint example with the builtin SI model
nw <- network_initialize(n = 50)
est <- netest(
nw,
formation = ~edges,
target.stats = 24,
coef.diss = dissolution_coefs(~ offset(edges), 10, 0),
verbose = FALSE
)
param <- param.net(inf.prob = 0.3, act.rate = 0.5)
init <- init.net(i.num = 10)
# the checkpoint arguments are set here
control <- control.net(
type = "SI", nsims = 3, nsteps = 1000, ncores = 2, verbose = FALSE,
.checkpoint.dir = "cp_tmp", .checkpoint.steps = 100
)
mod <- netsim(est, param, init, control)
On the above example, we ask EpiModel::netsim
to save the state of each
simulation every 100 steps. The checkpoint files (RDS files) are saved in the
"cp_tmp" folder. This folder will be created by EpiModel::netsim
if it does
not exist.
To restart from the checkpoints (i.e. if the code was interrupted), one simply
needs to re-run the above code. Because the "cp_tmp" folder now contains
the checkpoint files, EpiModel::netsim
will load them and restart from there.
By default, if EpiModel::netsim
reaches the end of all simulations, the
checkpoint directory and its content is removed before returning the "netsim"
object. The ".checkpoint.keep" argument to EpiModel::control.net
can be set to
TRUE to prevent this removal. This can be useful to inspect the raw simulation
objects for debugging purposes.
Finally, a ".checkpoint.compress" argument can be set to overwrite the
"compress" argument in saveRDS
used to save the checkpointed simulations. The
current default for saveRDS
is "gunzip" (gz) that provides a fast compression
that usually works well on network simulation objects.
Checkpointing Models on an HPC with SLURM
When working with SLURM, the checkpointing system can be used as when working
interactively with
srun
.
When running R with a batch job with sbatch
,
the "sim.R" file must add the following lines just before the call to EpiModel::netsim
:
# definition of `est1`, `param`, `init`, `control`
# ...
#
if (!is.null(control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir"))) {
batch_num <- Sys.getenv("SLURM_ARRAY_TASK_ID")
control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir") <- paste0(
control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir"), "/batch_", batch_num, ""
)
}
mod1 <- netsim(est1, param, init, control)
This snippet gets the job array number defined by the --array=1-2
) argument
to sbatch
and use it to append "/batch_1" and "/batch_2" to
control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir")
in order to create a unique checkpoint
sub-directory for each job in the array.
When working with EpiModelHPC::step_tmpl_netsim_scenarios
as shown in
EpiModelHPC Vignette about
slurmworkflow
,
no additional step is required. Just setting ".checkpoint.dir" and
".checkpoint.steps" is enough, the EpiModelHPC::step_tmpl_netsim_scenarios
function takes care of creating unique subdirectories for each job in the
array.