Using the Cluster CPU - dkoes/docs GitHub Wiki

Department Documentation: https://www.csb.pitt.edu/using-the-cluster/ We will be focusing on the slurm details

Connecting to the Head Node

ssh [email protected]

You will need to be the Pitt network or connected to the VPN with the CompBio role.

Cluster Etiquette

When running on the cluster, remember that this is a SHARED resource. As such, you don't want to be hogging the queue.

This is typically accomplished by using a % modifier on job arrays, which sets the maximum number of jobs from said array that can be running on the cluster. Typical restrictions are 100-200 for CPU jobs (there are a lot of cores available, and CPU jobs are typically not intensive). You should keep an eye on the queue and if people are waiting for your jobs adjust the max number of jobs.

Another snag concerns network load. As an example lots of rsync commands for large files will bog down the server (there is only so much bandwidth afterall). Try not to have more than 20 jobs rsyncing large files.

Launch an interactive session

srun --pty -p dept_cpu /bin/bash -i

Change any flag as needed.

Queues

There are many available that can be viewed with sinfo. We will mainly be using dept resources

dept_cpu for non-GPU jobs

big_memory node has 1TB of memory, only use if you actually need that much memory

any_cpu Job will be scheduled on any available cpu (including nodes in group queues and gpu queue). Jobs can be preempted (killed if another jobs needs the resource). Use when you have a ton of little jobs (or larger jobs that save checkpoints).

any_gpu Job will be scheduled on any available GPU node (including nodes in group queues). Jobs can be preempted. Use when you have a ton of little jobs (or larger jobs that save checkpoints).

Array Jobs

Array jobs launch many jobs using the same run script, changing only the SLURM_ARRAY_TASK_ID variable. These are good ways to launch large numbers of jobs without overburdening the queue system. You might use this to run multiple docking jobs. It also supports an easy pattern from running many commands listed in a text file (one per a line):

Slurm Script

#!/bin/bash
#SBATCH -J example_array
#SBATCH -t 14-00:00:00
#SBATCH --partition=dept_cpu
#SBATCH --cpus-per-task=1

cmd=`sed -n "${SLURM_ARRAY_TASK_ID}p" your_cmds.txt`
eval $cmd
exit

Command file, `your_cmds.txt`:

One command per line. Line numbers are 1-indexed.

python train.py 1 --do-stuff
python train.py 2 --do-stuff
...

Launch jobs 1-1000 of an array job script, while only letting 24 run at a time

sbatch --array 1-1000%24 your_array_job_script.slurm

Updating the maximum number of running jobs from an array

If you have already submitted an array job, and need to change the % modifier:

scontrol update ArrayTaskThrottle=<my % limit> JobId=<my array id>

Checking job status

squeue -u USERNAME

-u USERNAME shows only jobs from a certain user

Checking output

Upon completion anything printed to stdout/stderr should be copied into a file slurm-<jobid>.out. This is the first place to look for errors when jobs end abruptly.

Cluster status

To see the number of available nodes/cores:

sinfo --format="%.12P %.12F %C"