Running Jobs - calab-ntu/gpu-cluster GitHub Wiki

This page covers the following topics:

Common Commands

  • Submit a job: qsub YOUR_JOB_SCRIPT, which will return a unique JOB_ID (e.g., 7235). See qsub documentation for details.
  • Query jobs: qstat [-n], showq, node
  • Delete a job: qdel JOB_ID
  • Remove a job dependency: qalter -W depend=afterany JOB_ID
  • Interactive queue:
    • Eureka: qsub -X -I -lnodes=1:ppn=16 -lwalltime=4:0:0
    • Spock : qsub -X -I -lnodes=1:ppn=32 -lwalltime=4:0:0
  • Query node status: pbsnodes NODE_NAME

Example Job Scripts

Eureka

#!/bin/bash

#PBS -N YOUR_JOB_NAME
#PBS -M YOUR_EMAIL
#PBS -m abe
#PBS -q workq
#PBS -k n
#PBS -l walltime=4:00:00
#PBS -l nodes=1:ppn=16

cd $PBS_O_WORKDIR

mpirun -map-by ppr:16:socket:pe=1 ./YOUR_EXECUTABLE 1>>stdout 2>>stderr

Spock

#!/bin/bash

#PBS -N YOUR_JOB_NAME
#PBS -M YOUR_EMAIL
#PBS -m abe
#PBS -q workq
#PBS -k n
#PBS -l walltime=4:00:00
#PBS -l nodes=1:ppn=32

cd $PBS_O_WORKDIR

mpirun -map-by ppr:32:socket:pe=1 ./YOUR_EXECUTABLE 1>>stdout 2>>stderr

echo "Terminating CUDA MPS server"
mpirun -map-by ppr:1:node:pe=1 kill_nvidia_MPS_local.sh
  • PBS -N → Your job name
  • PBS -M → Your email address
  • PBS -m → Email notification
  • PBS -q → Job queue (see Job Queues)
  • PBS -k → Whether or not to retain standard output and/or standard error on the execution host
  • PBS -l walltime → Total wall time
  • PBS -l nodes=X:ppn=Y → Numbers of nodes (nodes=X) and processors per node (ppn=Y)
  • ppr:n → Launch n MPI processes on each node
  • pe=m → Bind m threads per MPI process

Caution:

  • Always use -ppn=16 on eureka and -ppy=32 on spock to request the entire node to avoid job conflicts.
  • Ensure n*m=16 on eureka and n*m=32 on spock.

Example job scripts for running GAMER:

Job Queues

Use #PBS -q queue_name to specify a queue.

Queue Name Property
workq Daily usage (default queue)
stableq Simulations requiring high stability
unstableq Unstable nodes (use with care)
allq All nodes
rtx3080tiq A single node (eureka32) with an NVIDIA GeForce RTX 3080 Ti GPU (set GPU_ARCH=AMPERE in GAMER)
gtx1080tiq A single node (eureka33) with two NVIDIA GeForce GTX 1080 Ti GPUs (set GPU_ARCH=PASCAL in GAMER)
maintenanceq Maintenance only

Monitoring Jobs

Links