SLURM and Job Submission - NSBLab/MATLAB-connectome-intro GitHub Wiki

Introduction

The SLURM Workload Manager coordinates the activity and scheduling of jobs among the cluster to ensure that resources are allocated efficiently across each node. These jobs can range from simple, short commands to more complex, lengthy commands.

Sometimes, you might have a particular command or shell script that you want to run without needing a desktop open (e.g., you want it to run overnight, you want to run multiple jobs at once, or you just need more power). To do this, we can send shell scripts as batch jobs directly to the cluster via SLURM.

Basic use

The command ‘sbatch <your_job.sh>’ will submit the script titled <your_job.sh> as a batch job to SLURM. The command ‘squeue --user $USER’ will display the current job queue with the status of your running and pending jobs. A basic introduction and list of SLURM commands can be found here.

Providing certain arguments within your job script can also help SLURM allocate the job. These should be placed at the start of the shell script as ‘#SBATCH --[argument]’. A full list of SBATCH options can be found here, but below are a few common ones:

#SBATCH --account=

This manages which project the job is submitted to. We typically use kg98.

#SBATCH --job-name=

Naming your jobs can be helpful if you’re submitting multiple jobs at once that do different things

#SBATCH --time=

The maximum time your job will run. The job will end normally if your script finishes in less time than specified, but the job will be killed if it goes beyond this time

#SBATCH --ntasks=

#SBATCH --cpus-per-task=

#SBATCH --mem-per-cpu=

These options control how much memory and how many resources are allocated to your job. Higher values can make your job spend more time in the queue, but not requesting enough can make your job fail due to a segmentation fault and not enough available memory.

#SBATCH --mail-user= #SBATCH --mail-type=

These options will notify you by email when your job successfully finishes or if your job fails

Advanced techniques

Sometimes you might want to run a particular job across each subject within a dataset. While you can do this with a for loop in bash, these subjects will be running serially (one at a time in a sequence), which can be slow in big samples. Instead, you can parallelise these jobs to run together through job arrays.

This can be done in SBATCH with ‘#SBATCH --array=’ where you specify the number of jobs within your array

e.g., #SBATCH --array=1-11 if you have 11 subjects to include in your array

Next, include a text file containing all of your subject IDs

subject_list="/path/to/sublist.txt"

Finally, allocate each string in sublist.txt to a job within your array

#SLURM_ARRAY_TASK_ID=1
subject=$(sed -n "${SLURM_ARRAY_TASK_ID}p" ${subject_list})

This will function the same as a performing a for loop on your sublist file (e.g., for subject in cat /path/to/sublist.txt), however, submitting each subject as a separate batch job will run much faster in most instances.

Other things to keep in mind

Only 1000 jobs per user can be submitted at any one time (including currently pending or running jobs and desktops)