SLURM - nthu-ioa/cluster GitHub Wiki
The batch queue system on our cluster is called Slurm. This manages the resources allocated to jobs running on the compute nodes.
- Quick Start
- Basic commands
- Interactive jobs
- What queues are available?
- GPUs
- Logs
- Getting information about jobs
- Environment variables
- Important notes on jobs that use multiple threads
- Important notes on using Python, Conda and Jupyter with Slurm
- Job arrays
- Slurm and GPUS
- Moving from Torque to Slurm
There is lots of useful information about Slurm on the web. If you find helpful information, please add a link here.
- https://helpdesk.strw.leidenuniv.nl/wiki/doku.php?id=slurm_tutorial
- https://arc-ts.umich.edu/beta/slurm-user-guide/
- https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/
- https://www.mpcdf.mpg.de/services/computing/software/languages-1/python
- https://rcc.uchicago.edu/docs/tutorials/kicp-tutorials/running-jobs.html
Remember that our system might not be configured in the same way as those in the pages above. Our current Slurm version is 19.05.4. In particular, if your job uses MPI, you should use mpirun
rather than srun
in your job script (i.e. our cluster does not yet have the capability described here).
srun --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=1:00:00 --mem=32G --pty bash
- More examples below
- Write a script
my_job.sh
to start your program and specify the resources you need: examples below -
sbatch my_job.sh
to submit the job - Check job is running and see its JOBID with
squeue -u ${USER}
- STDOUT output is in
slurm-${JOBID}.out
See also the Slurm manual: https://slurm.schedmd.com/quickstart.html
Inspect the current state of the queues: squeue
fomalhaut> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
173 mem zsh apcooper R 15:57:57 1 m01
JOBID is a unique identifier for the job. NAME is a non-unique label given by the owner of the job. Jobs run on PARTITIONS, which are groups of NODES. Nodes are individual servers with resources including cpus, memory, and in some cases GPUs.
When you want to run a job, you have to ask Slurm to allocate the resources you need, using the sbatch
and srun
commands. The standard way to do this is to write a 'job script', which is just an ordinary shell script with some special lines that slurm can interpret. See below for instructions on writing a job script.
To submit a job script to slurm: sbatch my_job.sh
See sbatch --help
or srun --help
for the options to request different resources. Your job will go into the queue until enough resources are available, then it will run automatically.
To kill a running job or cancel a job waiting in the queue, using its job ID: scancel 12345
The important things are: use a single node, get a terminal with --pty
, start a shell (bash in these examples).
The simplest command to get an interactive shell is:
srun --pty bash
With no extra options, you will be allocated 2 threads on the CPU queue with access to 8 GB of RAM. Your session will have a lifetime of 10 days. You can (and ideally should) request a shorter runtime and only the resources you need for your job.
Single node, reserve 8 logical cores for 1 hour, start bash:
srun --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --time=1:00:00 --pty bash
Some options can be shortened, e.g. -N1
instead of --nodes=1
(see srun --help
).
Specify a partition and request a specific amount of RAM:
srun -N1 --ntasks-per-node=1 --time=1:00:00 --mem=32GB -p mem --pty bash
For the -p
(long version: --partition
) option, please see "queues" below.
If you are curious about why srun
is used for interactive jobs and what is the difference between srun
and sbatch
, see this this StackOverflow post.
We prefer you only use interactive sessions for really interactive work, and we strongly discourage starting long-running jobs in interactive terminals.
The following Slurm partitions are available:
-
cpu: General-purpose queue, including all the CPU nodes (
c01
toc13
). Use this for general MPI and OpenMP jobs, and serial jobs. Most users can use this queue without worrying about the difference Users who need to maximize performance for vert intensive jobs should consider specializing their jobs to one of the two different types of CPU nodes. All nodes have roughly 2GB of RAM per logical core. - c6140, c6240: subsets of the cpu queue containing only one of the two CPU types. For maximum performance, compile for only one of these types and use the appropriate queue.
-
mem: Jobs using this queue should try to make use of the larger memory available on the constituent nodes (
m01
,m02
). Jupyter notebook jobs should usem01
. -
gpu: 3 nodes (
g01
tog03
) equivalent to the Xeon 6140 CPU nodes, but with 2-4 GPU cards available per node (see cluster specification). See also below.
Notes: all nodes use hyper-threading, so the minimum number of logical cores allocated will be 2. All nodes pin jobs to cores.
When using the GPU nodes, the --gpus-per-node
option can be used to request use of the GPUs. For example, for an interactive job, to request 2 GPUs of any type:
srun -N1 --ntasks-per-node=1 --time=1:00:00 --mem=32GB -p gpu --gres=gpu:2 --pty bash
To restrict to a specific type of GPU, either restrict to a particular node based on the specification (e.g. --nodelist=g02
) or add to the--gres
option as follows:
-
--gres=gpu:rtx_a4000:1
for one RTX A4000 on g03 (change the final 1 if you want more...); -
--gres=gpu:rtx_2080_ti:1
for the GPUs on g01/g02; -
--gres=gpu:rtx_3080:1
for the GPUs on g04.
--gpus_per_task=2
is equivalent to --gres=gpu:2
.
By default, STDOUT and STDERR are combined into one log stream. To specify a file for this log:
#SBATCH -o %x.%a.out
The %x
will be replaced with the job name, and the %a
will be replaced with the job id (see also below). You don't have to use these wildcards (e.g. -o mylogfile.out
is fine) they can be helpful (for example if you don't want to overwrite old logs with new ones).
To write STDERR to a separate file, add:
#SBATCH -e %x.%a.err
/data
filesystem, not into /cluster/home
!
:warning:
sbatch
sets lots of useful environment variables that allow your jobs to learn about the Slurm environment they're running in: https://slurm.schedmd.com/sbatch.html#lbAK
For a single-node python job:
#!/bin/bash
#SBATCH --ntasks=1 # 1 node
#SBATCH --time=1:00:00 # 1 hour
#SBATCH --mem=32G # This much memory
#SBATCH --partition=cpu # CPU queue
#SBATCH --job-name=my_slurm_job
# Set up modules
module purge
module load python
# Pick a conda environment
source activate my_conda_environment
# Run
srun python do_something.py
You are encouraged to write your job scripts in bash if possible (currently you may run into trouble loading modules if you use other shells).
While the job is running, use squeue
for an overview, squeue -u
to show only your jobs.
Use sstat
for details on resource usage (the number on the end should be your job id):
sstat -a -o JobID,MaxRSS,AveRSS,MaxPages,AvePages,AveCPU,MaxDiskRead,MaxDiskWrite JOBID
After the job has run, use sacct
. This will only be available for ~5 minutes after the job has completed.
If you know the job ID, you can get a nice summary of essential information for completed jobs using:
seff JOBID
This will work even when the job no longer shows up in sacct
.