QUDA on Perlmutter - lattice/quda GitHub Wiki

Instructions last verified on June 7, 2022. Since Perlmutter is still a preproduction series these instructions may need to change at any time. Please contact us on the QUDA slack if they do not work

Environment

Due to the Cray MPI wrappers, some care is needed to set up a build environment and help QUDA's cmake build (and MILC's Makefile) properly find MPI. The following environment will load CUDA 11.5, gcc 11.2.0, and cmake 3.22, plus set other useful environment variables:

module purge
module load PrgEnv-gnu
module load cmake
module load cudatoolkit
module load craype-accel-nvidia80
export MPICH_GPU_SUPPORT_ENABLED=1
export CRAY_CPU_TARGET=x86-64

export CC=$(which cc)
export CXX=$(which CC)

export MPI_HOME=$MPICH_DIR
export MPI_CXX_COMPILER=$(which CC)
export MPI_CXX_COMPILER_FLAGS=$(CC --cray-print-opts=all)

Building

QUDA

With the previous environment variables in place, compiling QUDA is relatively straightforward. A reference QUDA installation that automatically downloads+builds QMP plus QIO, and includes the necessary bits to be used with MILC, is:

WORKING_DIRECTORY=$(pwd)
git clone --branch develop https://github.com/lattice/quda && mkdir build && cd build
cmake \
        -DCMAKE_BUILD_TYPE=RELEASE \
        -DQUDA_GPU_ARCH=sm_80 \
        -DQUDA_DIRAC_DEFAULT_OFF=ON \
        -DQUDA_DIRAC_STAGGERED=ON \
        -DQUDA_QMP=ON \
        -DQUDA_QIO=ON \
        -DQUDA_DOWNLOAD_USQCD=ON \
        ../quda
make -j install
cd $WORKING_DIRECTORY

MILC with QUDA

The MILC+QUDA helper scripts that come with MILC currently need to be modified to work on Perlmutter. For simplicity, we include raw commands below, which will be updated once the compile_* scripts have been updated.

MILC can be downloaded as

git clone --branch develop https://github.com/milc-qcd/milc_qcd

Compiling MILC with QMP+QIO+QUDA requires providing the directories to the QMP+QIO+QUDA installs. These can be found via

# Automated method to find the path to CUDA
PATH_TO_CUDA=$(which nvcc)
PATH_TO_CUDA=${PATH_TO_CUDA/\bin\/nvcc/}

# Paths to QUDA, QIO, QMP
PATH_TO_QUDA="${WORKING_DIRECTORY}/build/usqcd"
PATH_TO_QMP=$PATH_TO_QUDA
PATH_TO_QIO=$PATH_TO_QUDA

MILC RHMC

MILC RHMC can be compiled from the Makefile as:

> cd ${WORKING_DIRECTORY}/milc_qcd/ks_imp_rhmc
> cp ../Makefile .
> MY_CC=cc \
  MY_CXX=CC \
  CUDA_HOME=${PATH_TO_CUDA} \
  QUDA_HOME=${PATH_TO_QUDA} \
  WANTQUDA=true \
  WANT_FN_CG_GPU=true \
  WANT_FL_GPU=true \
  WANT_GF_GPU=true \
  WANT_FF_GPU=true \
  WANT_MIXED_PRECISION_GPU=2 \
  PRECISION=2 \
  MPP=true \
  OMP=true \
  WANTQIO=true \
  WANTQMP=true \
  QIOPAR=${PATH_TO_QIO} \
  QMPPAR=${PATH_TO_QMP} \
  PATH_TO_NVHPCSDK="" \
  make -j 1 su3_rhmd_hisq

Note: In principle, the cudatoolkit module should set the environment variable CUDA_HOME for you, so you may not need that line in the make command. It is included only for robustness/completeness.

MILC Spectrum Measurements

The MILC spectrum measurement executable can be built as:

> cd ${WORKING_DIRECTORY}/milc_qcd/ks_spectrum
> cp ../Makefile .
> MY_CC=cc \
  MY_CXX=CC \
  CUDA_HOME=${PATH_TO_CUDA} \
  QUDA_HOME=${PATH_TO_QUDA} \
  WANTQUDA=true \
  WANT_FN_CG_GPU=true \
  WANT_FL_GPU=true \
  WANT_GF_GPU=true \
  WANT_FF_GPU=true \
  WANT_MIXED_PRECISION_GPU=2 \
  PRECISION=2 \
  MPP=true \
  OMP=true \
  WANTQIO=true \
  WANTQMP=true \
  QIOPAR=${PATH_TO_QIO} \
  QMPPAR=${PATH_TO_QMP} \
  PATH_TO_NVHPCSDK="" \
  CGEOM="-DFIX_NODE_GEOM -DFIX_IONODE_GEOM" \
  KSCGMULTI="-DKS_MULTICG=HYBRID -DMULTISOURCE" \
  make -j 1 ks_spectrum_hisq

Note: In principle, the cudatoolkit module should set the environment variable CUDA_HOME for you, so you may not need that line in the make command. It is included only for robustness/completeness.

Running

Cray MPI requires various environment variable flags to run. These are subject to change, but for now a viable set is:

export QUDA_ENABLE_GDR=1
export MPICH_RDMA_ENABLED_CUDA=1
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_NEMESIS_ASYNC_PROGRESS=1

export OMP_NUM_THREADS=16
export SLURM_CPU_BIND=cores
export CRAY_ACCEL_TARGET=nvidia80

Running in an interactive node

An interactive node on Perlmutter, after adding your account, can be acquired via:

salloc  -A m[####]_g -C gpu -t 20 -N 1 --tasks-per-node 4 --gpus 4 --qos interactive

Make sure that your environment matches the environment defined on the top of this page. Further, the USQCD library paths should be added to your LD_LIBRARY_PATH when running MILC. The USQCD libraries exist in [path to QUDA build]/usqcd/lib.

QUDA's test executables can be run from the interactive node via srun. A reference command for staggered_invert_test on 1 GPU is straightforward and given by:

srun -n 1 ./staggered_invert_test

Likewise, a 4 GPU run with a 1x1x2x2 decomposition is given by:

srun -n 4 ./staggered_invert_test --gridsize 1 1 2 2

Submitting a SLURM script

A reference SLURM script, along with the binding script, for a 2 node, 4xGPU per node run is as follows. This script assumes the environment noted at the top of the page has been properly set.

Note: this script has not been fully pipe cleaned; bindings and environment variables may not be ideal yet.

With the scripts, the following bindings are performed:

  • CPU NUMA node 0, NIC 0, GPU 3
  • CPU NUMA node 1, NIC 1, GPU 2
  • CPU NUMA node 2, NIC 2, GPU 1
  • CPU NUMA node 3, NIC 3, GPU 0
#SBATCH -A m[####]_g      # Update for your account
#SBATCH -C gpu
### #SBATCH -q debug      # Update as appropriate
#SBATCH -t 1:00:00        # One hour runtime
#SBATCH -n 8              # This is the total number of MPI ranks, ideally 4 x the number of nodes
#SBATCH -J job_nae
#SBATCH --gpus-per-task=1 # 1 GPU per MPI task
#SBATCH --gpu-bind=none   # This is necessary to let all 4 ranks on a node access all 4 GPUs.
#SBATCH -c 32             # Possibly 32 cores per node; to be verified

ranks=8 # TBD: find how to determine this via SLURM environment variables

### MPI flags
export MPICH_RDMA_ENABLED_CUDA=1
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_NEMESIS_ASYNC_PROGRESS=1

### Cray/Slurm flags
export OMP_NUM_THREADS=16
export SLURM_CPU_BIND=cores
export CRAY_ACCEL_TARGET=nvidia80

### QUDA specific flags
export QUDA_RESOURCE_PATH=`pwd`/tunecache # location of QUDA autotune cache file
mkdir -p $QUDA_RESOURCE_PATH
export QUDA_ENABLE_GDR=1

export MPICH_VERSION_DISPLAY=1
export MPICH_OFI_NIC_VERBOSE=2

export MPICH_OFI_NIC_POLICY="USER"
export MPICH_OFI_NIC_MAPPING="0:3;1:2;2:1;3:0"
echo "MPICH_OFI_NIC_POLICY=${MPICH_OFI_NIC_POLICY}"
echo "MPICH_OFI_NIC_MAPPING=${MPICH_OFI_NIC_MAPPING}"

EXE="./staggered_invert_test"
ARGS="--dim 32 32 32 32 --gridsize 1 1 2 4"

APP="$EXE $ARGS"
srun -n $ranks ./bind.sh ${APP}

The corresponding binding script:

#!/bin/bash
LOCAL_RANK=${SLURM_LOCALID}

export CUDA_VISIBLE_DEVCIES=0,1,2,3

echo "CMD=$@"

if [ $LOCAL_RANK -eq 0 ]; then
  nvidia-smi topo -m
fi

CPU_AFFINITY_MAP=(3 2 1 0)
CPU=${CPU_AFFINITY_MAP[$LOCAL_RANK]}

echo "LOCAL_RANK=${LOCAL_RANK}, RANK=${SLURM_PROCID}, CPU=${CPU}"

numactl --cpunodebind ${CPU} "$@"