Running R interactively with srun - EpiModel/EpiModeling GitHub Wiki

Slurm

Slurm is the job scheduling software on HPC that is used to assign your requested computational needs to the available resources on the HPC. This is involved for both interactive jobs where you are running a single instance of R but also batch jobs in which you may have hundreds of different model scenarios to run in a queue.

Partitions

The resources on the HPC are effectively divided into different partitions that are available to different users. In addition to the partitions that are listed in the RSPH HPC Guide (e.g., the day-long and week-long partitions), here are two that you will typically use for your jobs:

epimodel: we currently have 10 32-core nodes with priority access for our EpiModel lab. It is rare that all 10 nodes will be in use, so this should be your starting choice for a partition unless the queue for it is large.
preemptable: these are nodes "owned" by other research groups but which are available to us if those owners are not currently using them. There are approximately 40 nodes in this category. The caveat here is that if a node owner starts a job, you will get booted. If running a batch job, this is not a huge concern as the job will get requeued and restarted. Because of this, this partition should be your second choice in the case that the epimodel partition is full.

You can check whether the epimodel partition is full with:

squeue -p epimodel

This will show all the current jobs on the queue. If the number of jobs is larger than 10, you will see pending jobs that are waiting. In that case, you might submit a job to the preemptable queue.

Starting R on the epimodel partition

The core Slurm command start an interactive compute job is srun. It is used like this:

srun --cpus-per-task=32 -p epimodel --time=24:00:00 --mem=0 --pty /bin/bash

This asks Slurm to: start one new job, with 32 cores available (all nodes have up to 32 cores), on the epimodel partition, with a maximum run time (wall time) of 24 hours, with all available memory on the node, and to start up an interactive session.

You will use that exact syntax every time, with the exception of changing the partition or wall time. When running on the preemptable, you might consider lowering the wall time to something that more closely matches your expected use time (that could speed up how quickly a node is available). There are few times that you would want to use less than the full number of cores per job, but you could lower it down to 1 if you are not running any parallel code.

You will see once the interactive srun has been complete something like this:

[sjennes@clogin01 ~]$ srun --cpus-per-task=32 -p preemptable --time=24:00:00 --mem=0 --pty /bin/bash
srun: job 4358478 queued and waiting for resources
srun: job 4358478 has been allocated resources
[sjennes@node26 ~]$

Slurm has found you a node and your interactive job is ready to go. Node the command prompt changed to your node number.

Installing R Packages

Once you have started an interactive R session, you can then load R and start doing more computational tasks with it. To test this out, I recommended just installing EpiModel and all of its dependent packages. Remember that you are working with Spack, so you first need to load Spack before loading R:

lspack
spack load [email protected]

And then within R:

install.packages("EpiModel")

With that you can run other code you have been running locally on the HPC, but use the full 32 cores and ~190GB of memory on a node. Enjoy! Note also a good place to get started with a project here after you cloned your project repository, if you are using the renv R package manager, is to get that environment set up with renv::init() on the HPC to install all the specific packages you need (you may need to manually install renv inself first with install.packages("renv").