Running network models with checkpoints - EpiModel/EpiModeling GitHub Wiki

Checkpointed Models

A checkpointed model is a model that saves it state at regular interval and that is able to restart from the last checkpoint. A checkpoint system to prevent loosing all computation if the model is interrupted (SIGINT, power loss, time limit exceeded on a computation cluster).

From version 2.3.0 EpiModel::netsim offers a checkpoint system which is enabled by setting two control arguments to EpiModel::control.net:

  • ".checkpoint.steps": a positive number of steps to be run between each save
  • ".checkpoint.dir": the path to a directory to save the checkpoints
# Minimal checkpoint example with the builtin SI model

nw <- network_initialize(n = 50)
est <- netest(
  nw,
  formation = ~edges,
  target.stats = 24,
  coef.diss = dissolution_coefs(~ offset(edges), 10, 0),
  verbose = FALSE
)

param <- param.net(inf.prob = 0.3, act.rate = 0.5)
init <- init.net(i.num = 10)

# the checkpoint arguments are set here
control <- control.net(
  type = "SI", nsims = 3, nsteps = 1000, ncores = 2, verbose = FALSE,
  .checkpoint.dir = "cp_tmp", .checkpoint.steps = 100
)

mod <- netsim(est, param, init, control)

On the above example, we ask EpiModel::netsim to save the state of each simulation every 100 steps. The checkpoint files (RDS files) are saved in the "cp_tmp" folder. This folder will be created by EpiModel::netsim if it does not exist.

To restart from the checkpoints (i.e. if the code was interrupted), one simply needs to re-run the above code. Because the "cp_tmp" folder now contains the checkpoint files, EpiModel::netsim will load them and restart from there.

By default, if EpiModel::netsim reaches the end of all simulations, the checkpoint directory and its content is removed before returning the "netsim" object. The ".checkpoint.keep" argument to EpiModel::control.net can be set to TRUE to prevent this removal. This can be useful to inspect the raw simulation objects for debugging purposes.

Finally, a ".checkpoint.compress" argument can be set to overwrite the "compress" argument in saveRDS used to save the checkpointed simulations. The current default for saveRDS is "gunzip" (gz) that provides a fast compression that usually works well on network simulation objects.

Checkpointing Models on an HPC with SLURM

When working with SLURM, the checkpointing system can be used as when working interactively with srun. When running R with a batch job with sbatch, the "sim.R" file must add the following lines just before the call to EpiModel::netsim:

# definition of `est1`, `param`, `init`, `control`
# ...
#

if (!is.null(control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir"))) {
  batch_num <- Sys.getenv("SLURM_ARRAY_TASK_ID")
  control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir") <- paste0(
    control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir"), "/batch_", batch_num, ""
  )
}

mod1 <- netsim(est1, param, init, control)

This snippet gets the job array number defined by the --array=1-2) argument to sbatch and use it to append "/batch_1" and "/batch_2" to control[".checkpoint.dir"](/EpiModel/EpiModeling/wiki/".checkpoint.dir") in order to create a unique checkpoint sub-directory for each job in the array.

When working with EpiModelHPC::step_tmpl_netsim_scenarios as shown in EpiModelHPC Vignette about slurmworkflow, no additional step is required. Just setting ".checkpoint.dir" and ".checkpoint.steps" is enough, the EpiModelHPC::step_tmpl_netsim_scenarios function takes care of creating unique subdirectories for each job in the array.