Using the Cluster GPU - dkoes/docs GitHub Wiki
Department Documentation: https://www.csb.pitt.edu/using-the-cluster/
When running on the cluster, remember that this is a SHARED resource. As such, you don't want to be hogging the queue.
This is typically accomplished by using a %
modifier on job arrays, which sets the maximum number of jobs from said array that can be running on the cluster. Typical restrictions are 20 for GPU jobs.
Another snag concerns network load. As an example lots of rsync
commands for large files will bog down the server (there is only so much bandwidth afterall). Try not to have more than 20 jobs rsyncing large files.
srun --gres=gpu:1 --pty -p dept_gpu /bin/bash -i
This will request one GPU from the dept_gpu queue.
You should only use interactive jobs for debugging in testing. For actually running time consuming jobs you should launch a batch job as a bash script.
sbatch --gres=gpu:1 -p dept_gpu myscript.slurm
Additional configuration can be done by setting special variables within the slurm script (see below and the department cluster documentation for examples).
dept_gpu
All departmental GPU nodes. Anyone can use.
any_gpu
Job will be scheduled on any available GPU (including nodes in group queues), but has a time limit of 24 hours. Jobs can be preempted (killed and restarted) if the node is needed by a job in the regular queue. Use this is you have a lot of short jobs or jobs with frequent checkpoints.
The following aren't queues, but node features that can be specified as constraints, even using boolean operators (e.g. -C "C6&M12"
)
M12
: GPU memory >=12 GB (excludes 11GB cards)
gtx1080Ti
, Volta
, TitanV
, etc: Specific class of video card.
You can specify a specific node. Make sure that node is in the queue you are submitting to. If you specify a list of nodes your job will be allocated to all of them (not just one of them - so don't do this).
-w n154
You can also exclude hosts:
-x n079,n072
If your job requires a lot of memory, you should request a memory reservation. This prevents it from being scheduled on a node without enough memory and getting killed by the out of memory (oom) killer. If you job uses more than the requested amount of memory, it will get killed (even if there is enough memory available on the machine), so be conservative in your estimate of how much is needed.
--mem 4G
Array jobs launch many jobs using the same script, changing only the SLURM_ARRAY_TASK_ID variable. These are good ways to launch large numbers of jobs without overburdening the queue system. You might use this to run three replicates of an MD simulation using the same run script. It also supports an easy pattern from running many commands listed in a text file (one per a line):
#!/bin/bash
#SBATCH --job I_forgot_to_name_my_job
#SBATCH --partition=dept_gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
echo Running on `hostname`
echo workdir $SLURM_SUBMIT_DIR
echo ld_library_path $LD_LIBRARY_PATH
cd $SLURM_SUBMIT_DIR
#the following sets up your environment for running caffe/gnina
export PATH=/net/pulsar/home/koes/dkoes/local/bin:$PATH
export LD_LIBRARY_PATH=/net/pulsar/home/koes/dkoes/local/lib
export PYTHONPATH=/net/pulsar/home/koes/dkoes/local/python
module load cuda
#if necessary make a scr drive and copy files to it
cmd=`sed -n "${SLURM_ARRAY_TASK_ID}p" your_cmds.txt`
eval $cmd
One command per line. Line numbers are 1-indexed.
python train.py --do-stuff
python train.py --do-stuff
...
sbatch -a 1-24%5 your_array_job_script.slurm
Can specify individual jobs
sbatch -a 12,14 your_array_job_script.slurm
scontrol update ArrayTaskThrottle=<my maximum number of jobs> JobId=<my array id>
The following script assumes a simulation was prepared using prepareamber.py conventions and the base name of the simulation is identical to the directory it is stored in.
#!/bin/bash
#SBATCH --job I_forgot_to_name_my_jobs
#SBATCH --nodes=1
#SBATCH --partition=dept_gpu
#SBATCH --gres=gpu:1
echo Running on `hostname`
echo workdir $SLURM_SUBMIT_DIR
echo ld_library_path $LD_LIBRARY_PATH
cd $SLURM_SUBMIT_DIR
#scratch drive folder to work in
SCRDIR=/scr/${SLURM_JOB_ID}
module load amber/22
#if the scratch drive doesn't exist (it shouldn't) make it.
if [[ ! -e $SCRDIR ]]; then
mkdir $SCRDIR
fi
chmod +rX $SCRDIR
echo scratch drive ${SCRDIR}
cp $SLURM_SUBMIT_DIR/*.in ${SCRDIR}
cp $SLURM_SUBMIT_DIR/*.prmtop ${SCRDIR}
cp $SLURM_SUBMIT_DIR/*_md2.rst ${SCRDIR}
cp $SLURM_SUBMIT_DIR/*.inpcrd ${SCRDIR}
cd ${SCRDIR}
#setup to copy files back to working dir on exit
trap "mv *md3.nc $SLURM_SUBMIT_DIR" EXIT
#run the MD, default to name of directory CHANGE THIS IF YOUR DIRECTORY ISN'T YOUR PREFIX
prefix=${SLURM_SUBMIT_DIR##*/}
pmemd.cuda -O -i ${prefix}_md3.in -o $SLURM_SUBMIT_DIR/${prefix}_md3.out -p ${prefix}.prmtop -c ${prefix}_md2.rst -r ${prefix}_md3.rst -x ${prefix}_md3.nc -inf $SLURM_SUBMIT_DIR/mdinfo
It is common to want to run multiple simulations of the same system. This can be easily accomplished by using an array job (sbatch -a 1-3) and modifying the last line above to:
pmemd.cuda -O -i ${prefix}_md3.in -o $SLURM_SUBMIT_DIR/${prefix}_${SLURM_ARRAY_TASK_ID}_md3.out -p ${prefix}.prmtop -c ${prefix}_md2.rst -r ${prefix}_${SLURM_ARRAY_TASK_ID}_md3.rst -x ${prefix}_${SLURM_ARRAY_TASK_ID}_md3.nc -inf $SLURM_SUBMIT_DIR/mdinfo.${SLURM_ARRAY_TASK_ID}
Note this only has the desired result if the amber configuration sets initial velocities randomly (irest=0).
alias q='squeue -o "%10i %11P %12j %8u %2t %10M %9l %3D %3C %R"'
alias qd='squeue -u dkoes -o "%10i %11P %12j %8u %2t %10M %9l %3D %3C %R"'
alias qg='squeue -o "%10i %11P %12j %8u %2t %10M %9l %3D %3C %R" -p dept_gpu,any_gpu'
alias g='~dkoes/git/scripts/slurm_gpus.py'
alias gd='~dkoes/git/scripts/slurm_gpus.py dept_gpu'
These are my aliases to get nicely formatted and filtered queue info. q
shows all the jobs in the q
,
qd
shows only mine, and qg
shows only the GPU jobs.
/net/pulsar/home/koes/dkoes/git/scripts/slurm_gpus.py
Shows per-GPU status. Can limit output to a provided queue (e.g. slurm_gpus.py dept_gpu
)
Upon completion anything printed to stdout/stderr should be copied into a file slurm-<jobid>.out
. This is the first place to look for errors when jobs end abruptly.
You will need to setup your local environment:
module load cuda
export PYTHONPATH=$PYTHONPATH:/net/pulsar/home/koes/dkoes/local/lib/python3.6/site-packages
export LD_LIBRARY_PATH=/net/pulsar/home/koes/dkoes/local/lib:$LD_LIBRARY_PATH
Node g019 contains four NVIDIA A100 GPUs that are very fast and have 40gb of memory each. However, they use a newer CUDA architecture (sm_80) that requires CUDA 11. If you try to run pytorch using these GPUs when you have an older version of CUDA loaded, you will get the following error message:
A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
Then, get the approprate command from the pytorch website to update to the latest stable version for CUDA 11.1. As of writing this, that command is:
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html