Multi GPU Support - lattice/quda GitHub Wiki

Contents

Compiling for Multi-GPU

To enable multi-GPU support, you need to set either QUDA_MPI=ON or QUDA_QMP=ON to enable either the MPI or QMP communications back end. QMP is the USQCD QCD communications layer, that provides compatibility with other USQCD software packages. To enable QUDA to use QIO directly, you need to enable QMP. cmake should detect the MPI compiler and libraries by default, but can be set manually if needed (use ccmake to configure then switch to advanced options using 't' to set all the paths by hand). If you are using OpenMPI, we recommend using version 4.0.x.

Compiling QMP and QIO

We recommend using QUDA's automated download and compile feature, documented here.

If a custom compilation is needed/desired (possibly with Cray CC), you can compile QMP and QIO + c-lime manually. We advise using the 2-5-3 tag of QMP and the extended branch of QIO.

  • QMP compilation instructions can be found here.
  • QIO compilation requires c-lime as a dependency. This is most easily obtained using recursive cloning git clone --recursive [email protected]:usqcd-software/qio.git. Navigate to qio/ and execute the command autoreconf -f -i. You can now run configure for your preferences, including the --with-qmp=[...] flag to specify a QMP install directory, then make and finally make install.

Running

Running on multiple GPUs is similar to running any other MPI application. In general, one process will be assigned to each GPU. Make sure that all of QUDA's environmental variables are propagated to all processes since these control some of QUDA's internal control flow. The run/bind scripts given below handle this automatically. Alternatively, one can set the broadcast of environment variables using the job launcher, e.g., with OpenMPI's mpirun using -x QUDA_RESOURCE_PATH=/path/to/somewhere.

Of note, if strictly necessary QUDA can be run under MPS (enabling GPU oversubscription with minimal overheads). This can be done via the QUDA_ENABLE_MPS environment variable. This is only for reference; in general you never need to do this.

Running QUDA's tests

When running QUDA through a host application, typically the host application is responsible for setting the process topology and local problem size. For QUDA's internal tests, these parameters are set using the following command-line parameters

--dim x y z t           # x y z t is the local (per process) problem size
--gridsize X Y Z T      # X Y Z T is the process topology

Multi-GPU emulation

To aid performance modelling and debugging, it is possible to switch on "emulated" communication in a given dimension (using rank-local pack/unpack kernels), even if in actuality that dimension is local to a given GPU. The command line flag --partition N facilitates this feature, where N is a 4-bit number, with bits 0,1,2,3 used to switch on/off communication in dimensions x,y,z,t (respectively). For example:

dslash_test --partition 1     ## enable x dimension communication
dslash_test --partition 6     ## enable y and z dimension communication
dslash_test --partition 15    ## enable full communication

Peer-to-peer communication

QUDA will automatically detect multiple GPUs in the same node and use direct peer-to-peer communication where available. For GPUs to be peer-to-peer capable, they need to be either on the same PCIe root complex (e.g., connected to the same CPU socket or PCIe switch) or be directly connected with an NVLink connection. While peer-to-peer communication will lead to much improved performance versus leaving MPI to handle the inter-GPU communication, it can useful for benchmarking and/or debugging to disable it. This can be done by setting the environment variable QUDA_ENABLE_P2P=0.

GPU Direct RDMA and CUDA-aware MPI

QUDA can be optionally support GPU-aware MPI and GPU Direct RDMA (GDR), i.e., where data is passed directly to MPI without first copying it to the host, or conversely data is received directly into GPU memory. By default this option is disabled since passing a GPU pointer to an MPI library that is unaware of GPUs will lead to undefined behaviour (most likely a segmentation fault). This can be enabled by setting the environment variable QUDA_ENABLE_GDR=1.

When doing so, you should ensure that the MPI library you are using is also GPU enabled and the network drivers support it. For Mellanox Infiniband, this means OFED v2.1 or v3.1 and later (which depends on which IB card is in your system). Details for Mellanox can be found here. As another note, GDR sometimes does not work if the GPUs are in exclusive mode.

On systems that do not support GDR, but are running a CUDA-aware MPI library, e.g., OpenMPI, MVAPICH2, then the MPI library can automatically stage the MPI buffers in CPU memory if provided with a GPU pointer. Typically letting the MPI library take care of this staging is slower than having QUDA do it since it introduces unnecessary synchronization. However, we note that on systems that do not have a NIC available, enabling GDR support and using this in combination with GPU-aware MPI can be beneficial for debugging, if not performance.

It should be noted that enabling GDR will never make the performance worse, since the dslash policy autotuner will automatically test all enabled policies, e.g., basic, GDR-enabled, etc., and pick the best one for each given precision, volume, etc. Details on the dslash policy tuning are given below.

OpenMPI

We recommend taking advantage of OpenMPI provided by your system administrator and working with them if you see sub-par performance. Below we give an example run script of how to use OpenMPI with GDR support and instructions for optimal process placement.

For reference, for ex. for local workstation experiments, instructions for building CUDA-aware OpenMPI can be found here and instructions for running CUDA-aware OpenMPI can be found here.

MVAPICH2

Instructions for running the current GDR-enabled MVAPICH2 can be found here. MVAPICH2-GDR is only available as a binary, but the source code for the regular CUDA-aware MVAPICH2 (with host message staging) is available.

To enable CUDA-awareness in MVAPICH2, if building from source you must set --enable-cuda when running configure. When running, you must set the environment variable MV2_USE_CUDA=1. Specific GDR-related instructions are below.

Cray MPI (legacy, will be updated for Perlmutter)

To GPU-awareness on Cray's MPI you need to set the environment variable MPICH_RDMA_ENABLED_CUDA=1. At present Cray's implementation provides no user control over which messages will be exchanged using RDMA versus using host staging. This means that MPI exchange can go through the CPU memory with no means to force enable RDMA. The end result is that only very small volumes will utilize RDMA on Cray's XC platform, which most likely means only coarse grids with multigrid.

We note that performance on Cray systems may be improved by enabling MPICH_NEMESIS_ASYNC_PROGRESS=1, which enables results in the MPI library spawning threads to ensure the forward progress of asynchronous MPI calls (which QUDA utilizes).

Maximizing GDR performance

On systems with multiple GPUs and multiple NIC, to ensure maximum GPU-NIC throughput, care must be taken to ensure that GPUs communicate with the closest NIC. This can be done by querying the topology of the machine you are running on, and then instrumenting your MPI and / or run script to ensure correct placement.

For example, when running on DGX-1, which is a system with 4x EDR NICs and 8x P100 GPUs. Each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication.

First of all, we query the node topology with nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_2	mlx5_1	mlx5_3	CPU Affinity
GPU0	 X 	NV1	NV1	NV1	NV1	SOC	SOC	SOC	PIX	SOC	PHB	SOC	0-19
GPU1	NV1	 X 	NV1	NV1	SOC	NV1	SOC	SOC	PIX	SOC	PHB	SOC	0-19
GPU2	NV1	NV1	 X 	NV1	SOC	SOC	NV1	SOC	PHB	SOC	PIX	SOC	0-19
GPU3	NV1	NV1	NV1	 X 	SOC	SOC	SOC	NV1	PHB	SOC	PIX	SOC	0-19
GPU4	NV1	SOC	SOC	SOC	 X 	NV1	NV1	NV1	SOC	PIX	SOC	PHB	20-39
GPU5	SOC	NV1	SOC	SOC	NV1	 X 	NV1	NV1	SOC	PIX	SOC	PHB	20-39
GPU6	SOC	SOC	NV1	SOC	NV1	NV1	 X 	NV1	SOC	PHB	SOC	PIX	20-39
GPU7	SOC	SOC	SOC	NV1	NV1	NV1	NV1	 X 	SOC	PHB	SOC	PIX	20-39
mlx5_0	PIX	PIX	PHB	PHB	SOC	SOC	SOC	SOC	 X 	SOC	PHB	SOC	
mlx5_2	SOC	SOC	SOC	SOC	PIX	PIX	PHB	PHB	SOC	 X 	SOC	PHB	
mlx5_1	PHB	PHB	PIX	PIX	SOC	SOC	SOC	SOC	PHB	SOC	 X 	SOC	
mlx5_3	SOC	SOC	SOC	SOC	PHB	PHB	PIX	PIX	SOC	PHB	SOC	 X 	

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

We see that there are eight GPUs and four NICs as expected. The critical point is that GPU0 and GPU1 are both connected directly to mlx5_0 on the same PCIe switch, with GPU2 and GPU3 on mlx5_1, etc. So when launching our job on multiple nodes we need to ensure that processes mapped to the these GPUs are instructed to use these NICs.

Open MPI <= 3 (without UCX)

Moved to the legacy section here. Best practices is to use the most up-to-date version of OpenMPI with UCX.

OpenMPI 4 with UCX

While the general discussion given for OpenMPI 3 still applies the scriptsneed to be slightly modified to use the correct environment variables for UCX when using OpenMPI 4 with UCX. The run.sh script for UCX looks like:

#!/bin/bash

# QUDA specific-environment variables

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
export QUDA_ENABLE_TUNING=1
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5 6 7)

# This is the list of NICs we should use for each GPU
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3
NICS=(mlx5_0:1 mlx5_0:1 mlx5_1:1 mlx5_1:1 mlx5_2:1 mlx5_2:1 mlx5_3:1 mlx5_3:1)

# This is the list of CPU cores we should use for each GPU
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)

# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5 6 7)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export UCX_NET_DEVICES=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} \
        $APP

In the example above, the REORDER variable tells us the order we want the GPUs to map to the local MPI process. Here we have only used the default ordering, e.g., REORDER=(0 1 2 3 4 5 6 7), which would produce an optimal mapping for a local 1x2x2x2 process topology (e.g., given the NVLink topology of DGX-1, GPU 0 can communicate with GPUs 1, 2 and 4 which are the only GPUs needed for this 3-d topology). However, if we were running with 1x1x2x4 local process topology (given that the default MPI process topology is ((pt*Nz + pz)Ny + py)Nx + px then process 0 would need to be able to communicate with processes 1 (Z +/-), 2 (T+) and 6 (T-), but GPU 0 only has connections to GPUs 1, 2, 3, and 4.** So in this case, we would want to use REORDER=(0 1 2 3 6 7 4 5) which would map GPU 4 to process 6 providing the optimal peer-to-peer connectivity matrix.

** This is the default for QUDA and MILC. BQCD on the other hand uses the inverse of this mapping ((px*Ny + py)Nz + pz)Nt + pt. In this case, BQCD mapping would actually provide the optimal peer-to-peer mapping with the default GPU order.

Given the above binding script, the corresponding MPI launch command is then (note: update with latest version of UCX):

export UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc                           # select transports
export UCX_MEMTYPE_CACHE=n                                                 # see https://github.com/openucx/ucx/wiki/NVIDIA-GPU-Support
export UCX_RNDV_SCHEME=get_zcopy                                           # improves GPUDirectRDMA performance
export UCX_RNDV_THRESH=131304                                              # your mileage may vary

mpirun
 -np 48                                                                     # total number of processes
 -npernode 6                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 2 6 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh

MVAPICH2

For completeness, we give the equivalent scripts from above when using MVAPICH2-GDR with QUDA. Testing with the latest version of MVAPICH2-GDR (2.3alpha installation instructions are here) has shown issues with intra-node communication where either the sender or receiver are GPU pointers, leading to MVAPICH2-GDR seg faulting. This isn't a performance issue since QUDA's low-level handling of peer-to-peer communication within the node will almost certainly be superior to an MPI's implementation, however, since the Dslash policy tuner by default will test policies that include just handing a GPU pointer to MPI, we need to explicitly disable these policies from being tested. We can do this by setting the environment variable QUDA_ENABLE_P2P=7, which will enable both copy-engine and direct store dslash communication policies but will disable the non-explicit P2P policies.

The equivalent launch and run scripts for DGX-1 are shown below. The main difference between OpenMPI and MVAPICH being that the latter relies on environment variables for setting MPI parameters.

#!/bin/bash

# MVAPICH environment variables

# Enable CUDA-aware MPI
export MV2_USE_CUDA=1

# Enable GDR
export MV2_USE_GPUDIRECT=1

# Set maximum GDR message size 
export MV2_GPUDIRECT_LIMIT=4194304

# Enable GDRCOPY library (set to 0 if not installed on the system)
export MV2_USE_GPUDIRECT_GDRCOPY=1
export MV2_GPUDIRECT_GDRCOPY_LIB=/gpfs/sw/gdrdrv/install/lib64/libgdrapi.so

# Disable MVAPICH's internal affinity setting (since we'll do it manually using numactl)
export MV2_ENABLE_AFFINITY=0
export MV2_USE_MCAST=1

# Where $MPI_HOME should be the path to the MVAPICH installation
LD_PRELOAD+=:$MPI_HOME/lib64/libmpi.so

export EXE="./dslash_test"                                                     # executable
export ARGS="--gridsize 2 2 4 8 --dim 24 24 24 24 --prec double --niter 10000"

mpirun -np 128 -f hostfile ./run.sh

The equivalent run.sh is shown below, with the only difference being the environment variable for binding the NIC.

#!/bin/bash

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.

# enable GDR support
export QUDA_ENABLE_GDR=1

# disable non-P2P communication in the node
export QUDA_ENABLE_P2P=7

export CUDA_DEVICE_MAX_CONNECTIONS=1

# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5 6 7)

# This is the list of NICs we should use for each GPU
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3
NICS=(mlx5_0 mlx5_0 mlx5_1 mlx5_1 mlx5_2 mlx5_2 mlx5_3 mlx5_3)

# This is the list of CPU cores we should use for each GPU
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)

# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5 6 7)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})

APP="$EXE $ARGS"

lrank=$MV2_COMM_WORLD_LOCAL_RANK

export MV2_IBA_HCA=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} $APP

SpectrumMPI

A reference run script and binding script for Summit is given below. The script assumes that the environment variable APP has been defined as the full executable plus arguments.

Run script:

#!/bin/bash -v
#BSUB -P XXX
#BSUB -W 2:00
#BSUB -nnodes 432
#BSUB -J jobname
#BSUB -o jobOut.%J
#BSUB -e jobErr.%J
##### -cn_cu 'maxcus=48' # Set to num nodes / 18 to constrain racks; reduces throughput
#BSUB -alloc_flags "smt4"

# submit with
# bsub run.lsf

nodes=432
ranks=$[${nodes} * 6]

export QUDA_ENABLE_GDR=1
export QUDA_RESOURCE_PATH=`pwd`/tunecache
mkdir -p $QUDA_RESOURCE_PATH

# Generally HISQ MG only
#export QUDA_ENABLE_DEVICE_MEMORY_POOL=0
#export QUDA_ENABLE_MANAGED_MEMORY=1
#export QUDA_ENABLE_MANAGED_PREFETCH=1

# Prepare executable name
EXE=...
ARGS=...
export APP="${EXE} ${ARGS}"

# Setup for jsrun
export OMP_NUM_THREADS=7

echo "START_RUN: `date`"

# each resource set is one entire nodes, with 6 total MPI ranks each,
# with complete visibility of all 6 GPUs and 42 CPU cores (both sockets)
jsrun --nrs ${nodes} -a6 -g6 -c42 -dpacked -b packed:7 --latency_priority gpu-cpu --smpiargs="-gpu" ./bind-6gpu.sh

echo "FINISH_RUN: `date`"

Binding script:

#!/bin/bash

lrank=$(($PMIX_RANK % 6))

echo $APP

case ${lrank} in
 [0])
 export PAMI_IBV_DEVICE_NAME=mlx5_0:1
 numactl --physcpubind=0,4,8,12,16,20,24 --membind=0 $APP
 ;;

 [1])
 export PAMI_IBV_DEVICE_NAME=mlx5_0:1
 numactl --physcpubind=28,32,36,40,44,48,52 --membind=0 $APP
 ;;

 [2])
 export PAMI_IBV_DEVICE_NAME=mlx5_0:1
 numactl --physcpubind=56,60,64,68,72,76,80 --membind=0 $APP
 ;;

 [3])
 export PAMI_IBV_DEVICE_NAME=mlx5_3:1
 numactl --physcpubind=88,92,96,100,104,108,112 --membind=8 $APP
 ;;

 [4])
 export PAMI_IBV_DEVICE_NAME=mlx5_3:1
 numactl --physcpubind=116,120,124,128,132,136,140 --membind=8 $APP
 ;;

 [5])
 export PAMI_IBV_DEVICE_NAME=mlx5_3:1
 numactl --physcpubind=144,148,152,156,160,164,168 --membind=8 $APP
 ;;
esac

Dependence on CUDA_DEVICE_MAX_CONNECTIONS

There is an environment variable called CUDA_DEVICE_MAX_CONNECTIONS, this controls how many hardware channels the GPU should use, e.g., how much work can be launched independently from different streams without any false dependencies. However, since it has the lowest latency, QUDA gets optimum performance at CUDA_DEVICE_MAX_CONNECTIONS=1 since this gives the lower latency and you still get overlap between kernel launches and memory copies in general due to the order in which these are issued. So in general, the advice is to set this parameter equal to 1 and this will optimal scaling.

Low-level Details

Dslash Policy Tuning

Since the optimum Dslash overlapping computation and communication strategy varies depending the machine you running on, the size of the problem you running, the precision, etc., QUDA implements multiple dslash execution policies and utilizes the autotuner to identify the optimal strategy for a given parameter set and use that policy for all subsequent invocations (dslash_policy.cuh). At present the following policies are enabled in QUDA:

  • QUDA_DSLASH=0: bandwidth optimized - aim for maximum compute and comms overlap (one halo kernel per dimension)
  • QUDA_FUSED_DSLASH=1: kernel latency optimized - use a single halo update kernel for all dimensions
  • QUDA_GDR_DSLASH=2: GDR-enabled variant of QUDA_DSLASH
  • QUDA_FUSED_GDR_DSLASH=3: GDR-enabled variant of QUDA_FUSED_DSLASH
  • QUDA_GDR_RECV_DSLASH=4: variant of QUDA_DSLASH which only enables GDR for the receive
  • QUDA_FUSED_GDR_RECV_DSLASH=5: variant of QUDA_FUSED_DSLASH which only enables GDR for the receive
  • QUDA_ZERO_COPY_PACK_DSLASH=6: write the non-p2p packed halo buffers directly to CPU memory for minimum MPI_Send latency
  • QUDA_FUSED_ZERO_COPY_PACK_DSLASH=7: write non-p2p packed halo buffers directly to CPU memory and use fused halo kernel
  • QUDA_ZERO_COPY_DSLASH=8: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in halo update kernels
  • QUDA_FUSED_ZERO_COPY_DSLASH=9: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in a single halo update kernel
  • QUDA_ZERO_COPY_PACK_GDR_RECV_DSLASH=10
  • QUDA_FUSED_ZERO_COPY_PACK_GDR_RECV_DSLASH=11
  • QUDA_DSLASH_FUSED_PACK=12: fused dslash and halo packer kernel, the first thread blocks in the grid will pack the halo buffer and subsequent blocks will apply the interior dslash
  • QUDA_DSLASH_FUSED_PACK_FUSED_HALO=13: fused dslash and halo packer kernel and fused halo update kernel
  • QUDA_SHMEM_UBER_PACKINTRA_DSLASH=14: NVSHMEM policy using an Uber-kernel with packing for intra-node, interior and exterior fused into a single kernel, separate kernel for inter-node packing
  • QUDA_SHMEM_UBER_PACKFULL_DSLASH=15: NVSHMEM policy using an Uber-kernel with packing, interior and exterior fused into a single kernel
  • QUDA_SHMEM_PACKINTRA_DSLASH=16: NVSHMEM policy with intra-node packing and interior fused into a single kernel, separate inter-node packing and exterior kernels
  • QUDA_SHMEM_PACKFULL_DSLASH=17: NVSHMEM policy with packing and interior fused into a single kernel, separate exterior kernel

Note the GDR policies are only enabled if QUDA_ENABLE_GDR is set. In most instances, you will just want to let the autotuner pick the best policy for your parameter set. However, you can restrict the policy set to tune other by setting the environment argument QUDA_ENABLE_DSLASH_POLICY, e.g., setting QUDA_ENABLE_DSLASH_POLICY=1,3,5 would restrict the policy tuning to a subset of the "fused" variants only.

By default all policies will use peer-to-peer communication if available. To disable peer-to-peer, you set QUDA_ENABLE_P2P=0. Finally we note that if between a tuned run and a subsequent run, either the QUDA_ENABLE_P2P or QUDA_ENABLE_GDR environment variables change, then the autotuner will exit, forcing a retune.

Dslash Component Benchmarking

In order to benchmark the components of the Dslash in isolation, QUDA can selectively disable portions of the Dslash computation. This is useful for example to benchmarking NIC performance, or to test kernel performance in the absence of communication. The dslash computation is broken down into multiple steps:

  • packing: prepare contiguous halo buffers to be handed off to MPI / P2P communication
  • comms: p2p cudaMemcpy within the node and MPI between nodes
  • interior: apply the dslash stencil on the interior while the halo regions are being communicated
  • exterior: once the comms have finished we finish the calculation with the application of the halo on the boundary elements (copy: when GDR / P2P is not available between a set of GPUs, then we have the additional D2H/H2D memcpys for staging the MPI buffers in CPU memory)

The following set of environment variables can be used to disable the various parts of computation and/or the communication. All of the below variables default to 1 (e.g., do the full calculation), but can be disabled by setting to 0 (obviously result will be wrong).

  • QUDA_ENABLE_DSLASH_PACK - enable / disable initial packing kernel
  • QUDA_ENABLE_DSLASH_COMMS - enable/disable P2P memcpys and / or MPI exchange
  • QUDA_ENABLE_DSLASH_INTERIOR - enable/disable interior kernel computation
  • QUDA_ENABLE_DSLASH_EXTERIOR - enable/disable exterior kernel computation
  • QUDA_ENABLE_DSLASH_COPY - enable/disable host staging copies for MPI if GDR/P2P not enabled

By combining the explicit policy choice with the above variables, we can benchmark in isolation any computation or communication pattern.

For communication benchmarking, the dslash_test and staggered_dslash_test programs will report the effective bi-directional bandwidth sustained by the algorithm (just grep the output for “bi”). See the below results for an example taken between two GPUs connected using PCIe peer-to-peer. With the full computation enabled we are unable to see what the actual achieved bi-directional bandwidth is, e.g., it plateaus once all the communications are hidden by the local computation, but when you only do the communications we see the expected behavior and the bi-directional bandwidth is saturating at 19 GB/s. What we can also see is the bandwidth doesn't saturate until a relatively large local volume: this is the motivation for future work using a SHMEM-style programming model where all peer-to-peer reads and write will be done by reading and writing directly to neighboring GPUs, which has significantly lower latency.

Legacy Information

Asymmetric Topologies

On asymmetric systems where some GPUs are on one side of the QPI bus and the NIC is on the other, care must be taken since the QPI bus cannot efficiently forward the memory traffic between attached PCIe devices. For example the following system has four GPUs and a NIC one socket, with two GPUs and not NIC on the other socket. This system will only give efficient GDR support for the first four GPUs, with the other two needing to stage their inter-node memory traffic explicitly through CPU memory.

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	mlx5_0	CPU Affinity
GPU0	 X 	PIX	PHB	PHB	SOC	SOC	PHB	0-9
GPU1	PIX	 X 	PHB	PHB	SOC	SOC	PHB	0-9
GPU2	PHB	PHB	 X 	PIX	SOC	SOC	PHB	0-9
GPU3	PHB	PHB	PIX	 X 	SOC	SOC	PHB	0-9
GPU4	SOC	SOC	SOC	SOC	 X 	PHB	SOC	10-19
GPU5	SOC	SOC	SOC	SOC	PHB	 X 	SOC	10-19
mlx5_0	PHB	PHB	PHB	PHB	SOC	SOC	 X 	

To enable such a setup, the environment variable QUDA_ENABLE_GDR_BLACKLIST can be used to exclude a given number of GPUs from using GDR, and instead will fallback to using explicit staging through CPU memory. The below is an example of how to do this for the above topology using OpenMPI.

mpirun
 -np 48                                                                     # total number of processes
 -npernode 6                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 --mca btl sm,self,openib                                                   # enable intra-node, loop back to self, and IB
 --mca btl_openib_want_cuda_gdr 1                                           # enable GDR for MPI
 --mca btl_openib_cuda_rdma_limit 1000000000                                # set the largest message size for GDR
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 2 6 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh

where run.sh would be as given below

#!/bin/bash

# QUDA specific-environment variables

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.

# enable GDR support
export QUDA_ENABLE_GDR=1

# exclude GPUs 4 and 5 from GDR since it's across QPI
export QUDA_ENABLE_GDR_BLACKLIST="4,5"

export CUDA_DEVICE_MAX_CONNECTIONS=1

# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5)

# This is the list of NICs we should use for each GPU
NICS=(mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0)

# This is the list of CPU cores we should use for each GPU
# e.g., 2x10 core CPUs split into 2 threads per process with correct NUMA assignment
CPUS=(1-2 3-4 5-6 7-8 10-11 15-16)

# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=2

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping                                                                                                                                         
       
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} $APP

Open MPI <= 3 (without UCX)

The script below (for OpenMPI) achieves that. To use this script with QUDA's dslash_test, running on 16 nodes of DGX-1, it would be launched with something like

mpirun 
 -np 128                                                                    # total number of processes
 -npernode 8                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 --mca btl sm,self,openib                                                   # enable intra-node, loop back to self, and IB
 --mca btl_openib_want_cuda_gdr 1                                           # enable GDR for MPI
 --mca btl_openib_cuda_rdma_limit 1000000000                                # set the largest message size for GDR
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 4 8 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh                                                                   # name of the below script

In the run.sh script we set the order of CUDA devices as how they will be mapped to the local MPI ranks (the REORDER variable). Given this order, we ensure that the closest NIC for a given process is assigned, and furthermore we set the the CPU cores available for each process to obtain the correct non-overlapping NUMA mapping.

#!/bin/bash                                                                                                                                                                                                                                                

# QUDA specific-environment variables                                                                                                                                                                                                                      

# set the QUDA tunecache path                                                                                                                                                                                                                              
export QUDA_RESOURCE_PATH=.

# enable GDR support                                                                                                                                                                                                                                       
export QUDA_ENABLE_GDR=1

export CUDA_DEVICE_MAX_CONNECTIONS=1

# this is the list of GPUs we have                                                                                                                                                                                                                         
GPUS=(0 1 2 3 4 5 6 7)

# This is the list of NICs we should use for each GPU                                                                                                                                                                                                      
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3                                                                                                                                                                
NICS=(mlx5_0 mlx5_0 mlx5_1 mlx5_1 mlx5_2 mlx5_2 mlx5_3 mlx5_3)

# This is the list of CPU cores we should use for each GPU                                                                                                                                                                                                 
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment                                                                                                                                                                       
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)

# Number of physical CPU cores per GPU                                                                                                                                                                                                                     
export OMP_NUM_THREADS=4

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)                                                                                                                                                                      
REORDER=(0 1 2 3 4 5 6 7)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping                                                                                                                                                
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} $APP