SLURM - nthu-ioa/cluster GitHub Wiki

The batch queue system on our cluster is called Slurm. This manages the resources allocated to jobs running on the compute nodes.

Basic information (on this page)

Other pages with details about important topics

Other examples / tutorials for Slurm

There is lots of useful information about Slurm on the web. If you find helpful information, please add a link here.

Remember that our system might not be configured in the same way as those in the pages above. Our current Slurm version is 19.05.4. In particular, if your job uses MPI, you should use mpirun rather than srun in your job script (i.e. our cluster does not yet have the capability described here).


Quick start

Interactive job

  • srun --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=1:00:00 --mem=32G --pty bash
  • More examples below

Batch script

  • Write a script my_job.sh to start your program and specify the resources you need: examples below
  • sbatch my_job.sh to submit the job
  • Check job is running and see its JOBID with squeue -u ${USER}
  • STDOUT output is in slurm-${JOBID}.out

Basic commands

See also the Slurm manual: https://slurm.schedmd.com/quickstart.html

Inspect the current state of the queues: squeue

fomalhaut> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               173       mem      zsh apcooper  R   15:57:57      1 m01

JOBID is a unique identifier for the job. NAME is a non-unique label given by the owner of the job. Jobs run on PARTITIONS, which are groups of NODES. Nodes are individual servers with resources including cpus, memory, and in some cases GPUs.

When you want to run a job, you have to ask Slurm to allocate the resources you need, using the sbatch and srun commands. The standard way to do this is to write a 'job script', which is just an ordinary shell script with some special lines that slurm can interpret. See below for instructions on writing a job script.

To submit a job script to slurm: sbatch my_job.sh

See sbatch --help or srun --help for the options to request different resources. Your job will go into the queue until enough resources are available, then it will run automatically.

To kill a running job or cancel a job waiting in the queue, using its job ID: scancel 12345

Interactive jobs

The important things are: use a single node, get a terminal with --pty, start a shell (bash in these examples).

The simplest command to get an interactive shell is:

srun --pty bash

With no extra options, you will be allocated 2 threads on the CPU queue with access to 8 GB of RAM. Your session will have a lifetime of 10 days. You can (and ideally should) request a shorter runtime and only the resources you need for your job.

Single node, reserve 8 logical cores for 1 hour, start bash:

srun --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --time=1:00:00 --pty bash

Some options can be shortened, e.g. -N1 instead of --nodes=1 (see srun --help).

Specify a partition and request a specific amount of RAM:

srun -N1 --ntasks-per-node=1 --time=1:00:00 --mem=32GB -p mem --pty bash

For the -p (long version: --partition) option, please see "queues" below.

If you are curious about why srun is used for interactive jobs and what is the difference between srun and sbatch, see this this StackOverflow post.

We prefer you only use interactive sessions for really interactive work, and we strongly discourage starting long-running jobs in interactive terminals.

Queues (partitions)

The following Slurm partitions are available:

  • cpu: General-purpose queue, including all the CPU nodes (c01 to c13). Use this for general MPI and OpenMP jobs, and serial jobs. Most users can use this queue without worrying about the difference Users who need to maximize performance for vert intensive jobs should consider specializing their jobs to one of the two different types of CPU nodes. All nodes have roughly 2GB of RAM per logical core.
  • c6140, c6240: subsets of the cpu queue containing only one of the two CPU types. For maximum performance, compile for only one of these types and use the appropriate queue.
  • mem: Jobs using this queue should try to make use of the larger memory available on the constituent nodes (m01, m02). Jupyter notebook jobs should use m01.
  • gpu: 3 nodes (g01 to g03) equivalent to the Xeon 6140 CPU nodes, but with 2-4 GPU cards available per node (see cluster specification). See also below.

Notes: all nodes use hyper-threading, so the minimum number of logical cores allocated will be 2. All nodes pin jobs to cores.

GPUs

When using the GPU nodes, the --gpus-per-node option can be used to request use of the GPUs. For example, for an interactive job, to request 2 GPUs of any type:

srun -N1 --ntasks-per-node=1 --time=1:00:00 --mem=32GB -p gpu --gres=gpu:2 --pty bash

To restrict to a specific type of GPU, either restrict to a particular node based on the specification (e.g. --nodelist=g02) or add to the--gres option as follows:

  • --gres=gpu:rtx_a4000:1 for one RTX A4000 on g03 (change the final 1 if you want more...);
  • --gres=gpu:rtx_2080_ti:1 for the GPUs on g01/g02;
  • --gres=gpu:rtx_3080:1 for the GPUs on g04.

--gpus_per_task=2 is equivalent to --gres=gpu:2.

Logs

By default, STDOUT and STDERR are combined into one log stream. To specify a file for this log:

#SBATCH -o %x.%a.out

The %x will be replaced with the job name, and the %a will be replaced with the job id (see also below). You don't have to use these wildcards (e.g. -o mylogfile.out is fine) they can be helpful (for example if you don't want to overwrite old logs with new ones).

To write STDERR to a separate file, add:

#SBATCH -e %x.%a.err

⚠️ IMPORTANT: Please write your logs into the /data filesystem, not into /cluster/home! :warning:

Environment variables

sbatch sets lots of useful environment variables that allow your jobs to learn about the Slurm environment they're running in: https://slurm.schedmd.com/sbatch.html#lbAK

Example job scripts

For a single-node python job:

#!/bin/bash
#SBATCH --ntasks=1 # 1 node
#SBATCH --time=1:00:00 # 1 hour
#SBATCH --mem=32G # This much memory
#SBATCH --partition=cpu # CPU queue
#SBATCH --job-name=my_slurm_job

# Set up modules
module purge
module load python

# Pick a conda environment
source activate my_conda_environment

# Run
srun python do_something.py

You are encouraged to write your job scripts in bash if possible (currently you may run into trouble loading modules if you use other shells).

Information about jobs

While the job is running, use squeue for an overview, squeue -u to show only your jobs.

Use sstat for details on resource usage (the number on the end should be your job id):

sstat -a -o JobID,MaxRSS,AveRSS,MaxPages,AvePages,AveCPU,MaxDiskRead,MaxDiskWrite JOBID

After the job has run, use sacct. This will only be available for ~5 minutes after the job has completed.

If you know the job ID, you can get a nice summary of essential information for completed jobs using:

seff JOBID

This will work even when the job no longer shows up in sacct.

⚠️ **GitHub.com Fallback** ⚠️