Running Jobs - calab-ntu/gpu-cluster GitHub Wiki
This page covers the following topics:
Common Commands
- Submit a job:
qsub YOUR_JOB_SCRIPT
, which will return a uniqueJOB_ID
(e.g.,7235
). See qsub documentation for details. - Query jobs:
qstat [-n]
,showq
, node - Delete a job:
qdel JOB_ID
- Remove a job dependency:
qalter -W depend=afterany JOB_ID
- Interactive queue:
Eureka
:qsub -X -I -lnodes=1:ppn=16 -lwalltime=4:0:0
Spock
:qsub -X -I -lnodes=1:ppn=32 -lwalltime=4:0:0
- Query node status:
pbsnodes NODE_NAME
Example Job Scripts
Eureka
#!/bin/bash
#PBS -N YOUR_JOB_NAME
#PBS -M YOUR_EMAIL
#PBS -m abe
#PBS -q workq
#PBS -k n
#PBS -l walltime=4:00:00
#PBS -l nodes=1:ppn=16
cd $PBS_O_WORKDIR
mpirun -map-by ppr:16:socket:pe=1 ./YOUR_EXECUTABLE 1>>stdout 2>>stderr
Spock
#!/bin/bash
#PBS -N YOUR_JOB_NAME
#PBS -M YOUR_EMAIL
#PBS -m abe
#PBS -q workq
#PBS -k n
#PBS -l walltime=4:00:00
#PBS -l nodes=1:ppn=32
cd $PBS_O_WORKDIR
mpirun -map-by ppr:32:socket:pe=1 ./YOUR_EXECUTABLE 1>>stdout 2>>stderr
echo "Terminating CUDA MPS server"
mpirun -map-by ppr:1:node:pe=1 kill_nvidia_MPS_local.sh
PBS -N
→ Your job namePBS -M
→ Your email addressPBS -m
→ Email notificationPBS -q
→ Job queue (see Job Queues)PBS -k
→ Whether or not to retain standard output and/or standard error on the execution hostPBS -l walltime
→ Total wall timePBS -l nodes=X:ppn=Y
→ Numbers of nodes (nodes=X
) and processors per node (ppn=Y
)ppr:n
→ Launchn
MPI processes on each nodepe=m
→ Bindm
threads per MPI process
Caution:
- Always use
-ppn=16
on eureka and-ppy=32
on spock to request the entire node to avoid job conflicts. - Ensure
n*m=16
on eureka andn*m=32
on spock.
Example job scripts for running GAMER:
Job Queues
Use #PBS -q queue_name
to specify a queue.
Queue Name | Property |
---|---|
workq | Daily usage (default queue) |
stableq | Simulations requiring high stability |
unstableq | Unstable nodes (use with care) |
allq | All nodes |
rtx3080tiq | A single node (eureka32) with an NVIDIA GeForce RTX 3080 Ti GPU (set GPU_ARCH=AMPERE in GAMER) |
gtx1080tiq | A single node (eureka33) with two NVIDIA GeForce GTX 1080 Ti GPUs (set GPU_ARCH=PASCAL in GAMER) |
maintenanceq | Maintenance only |
Monitoring Jobs
node
: Monitor node