Using the Cluster CPU - dkoes/docs GitHub Wiki
Department Documentation: https://www.csb.pitt.edu/using-the-cluster/ We will be focusing on the slurm details
You will need to be the Pitt network or connected to the VPN with the CompBio role.
When running on the cluster, remember that this is a SHARED resource. As such, you don't want to be hogging the queue.
This is typically accomplished by using a %
modifier on job arrays, which sets the maximum number of jobs from said array that can be running on the cluster. Typical restrictions are 100-200 for CPU jobs (there are a lot of cores available, and CPU jobs are typically not intensive). You should keep an eye on the queue and if people are waiting for your jobs adjust the max number of jobs.
Another snag concerns network load. As an example lots of rsync
commands for large files will bog down the server (there is only so much bandwidth afterall). Try not to have more than 20 jobs rsyncing large files.
srun --pty -p dept_cpu /bin/bash -i
Change any flag as needed.
There are many available that can be viewed with sinfo
. We will mainly be using dept resources
dept_cpu
for non-GPU jobs
big_memory
node has 1TB of memory, only use if you actually need that much memory
any_cpu
Job will be scheduled on any available cpu (including nodes in group queues and gpu queue). Jobs can be preempted (killed if another jobs needs the resource). Use when you have a ton of little jobs (or larger jobs that save checkpoints).
any_gpu
Job will be scheduled on any available GPU node (including nodes in group queues). Jobs can be preempted. Use when you have a ton of little jobs (or larger jobs that save checkpoints).
Array jobs launch many jobs using the same run script, changing only the SLURM_ARRAY_TASK_ID variable. These are good ways to launch large numbers of jobs without overburdening the queue system. You might use this to run multiple docking jobs. It also supports an easy pattern from running many commands listed in a text file (one per a line):
#!/bin/bash
#SBATCH -J example_array
#SBATCH -t 14-00:00:00
#SBATCH --partition=dept_cpu
#SBATCH --cpus-per-task=1
cmd=`sed -n "${SLURM_ARRAY_TASK_ID}p" your_cmds.txt`
eval $cmd
exit
One command per line. Line numbers are 1-indexed.
python train.py 1 --do-stuff
python train.py 2 --do-stuff
...
sbatch --array 1-1000%24 your_array_job_script.slurm
If you have already submitted an array job, and need to change the %
modifier:
scontrol update ArrayTaskThrottle=<my % limit> JobId=<my array id>
squeue -u USERNAME
-u USERNAME
shows only jobs from a certain user
Upon completion anything printed to stdout/stderr should be copied into a file slurm-<jobid>.out
. This is the first place to look for errors when jobs end abruptly.
To see the number of available nodes/cores:
sinfo --format="%.12P %.12F %C"