Multi GPU Support - lattice/quda GitHub Wiki
Contents
- Compiling for Multi-GPU
- Running
- Running QUDA's Tests
- Multi-GPU Emulation
- Peer-to-peer Communication
- GPU Direct RDMA and CUDA-aware MPI
- Maximizing Communications Performance
- Dependence on CUDA_DEVICE_MAX_CONNECTIONS
- Low-level Details
- Dslash Policy Tuning
- Dslash Component Benchmarking
- Legacy Information
Compiling for Multi-GPU
To enable multi-GPU support, you need to set either QUDA_MPI=ON
or QUDA_QMP=ON
to enable either the MPI or QMP communications back end.
QMP is the USQCD QCD communications layer, that provides compatibility with other USQCD software packages. To enable QUDA to use QIO directly, you need to enable QMP. QUDA supports automatically downloading and compiling verified versions of QMP and QIO via the cmake
flag QUDA_DOWNLOAD_USQCD=ON
.
In the case of MPI, cmake
should detect the MPI compiler and libraries by default. If need be, the paths to an MPI installation (root directory, libraries, includes, and binaries) can be set manually. This can most easily be done by visual ccmake
configuration. MPI flags are under advanced options, accessed by using t
. If you are using OpenMPI, we recommend using version 4.0.x.
Compiling QMP and QIO
We recommend using QUDA's automated download and compile feature, documented here.
If a custom compilation is needed/desired (possibly with Cray CC
), you can compile QMP and QIO + c-lime manually. We advise using commit 3010fef
of QMP and the qio3-0-0
of QIO.
- QMP compilation instructions can be found here.
- QIO compilation requires c-lime as a dependency. This is most easily obtained using recursive cloning
git clone --recursive [email protected]:usqcd-software/qio.git
. Navigate toqio/
and execute the commandautoreconf -f -i
. You can now run configure for your preferences, including the--with-qmp=[...]
flag to specify a QMP install directory, thenmake
and finallymake install
.
Running
Running on multiple GPUs is similar to running any other MPI application. In general, one process will be assigned to each GPU. Make sure that all of QUDA's environmental variables are propagated to all processes since these control some of QUDA's internal control flow. The run/bind scripts given below handle this automatically. Alternatively, one can set the broadcast of environment variables using the job launcher, e.g., with OpenMPI's mpirun using -x QUDA_RESOURCE_PATH=/path/to/somewhere
.
Of note, if strictly necessary QUDA can be run under MPS (enabling GPU oversubscription with minimal overheads). This can be done via the QUDA_ENABLE_MPS
environment variable. This is only for reference; in general you never need to do this.
Running QUDA's tests
When running QUDA through a host application, typically the host application is responsible for setting the process topology and local problem size. For QUDA's internal tests, these parameters are set using the following command-line parameters
--dim x y z t # x y z t is the local (per process) problem size
--gridsize X Y Z T # X Y Z T is the process topology
Multi-GPU emulation
To aid performance modelling and debugging, it is possible to switch on "emulated" communication in a given dimension (using rank-local pack/unpack kernels), even if in actuality that dimension is local to a given GPU. The command line flag --partition N
facilitates this feature, where N is a 4-bit number, with bits 0,1,2,3 used to switch on/off communication in dimensions x,y,z,t (respectively). For example:
dslash_test --partition 1 ## enable x dimension communication
dslash_test --partition 6 ## enable y and z dimension communication
dslash_test --partition 15 ## enable full communication
Peer-to-peer communication
QUDA will automatically detect multiple GPUs in the same node and use direct peer-to-peer communication where available. For GPUs to be peer-to-peer capable, they need to be either on the same PCIe root complex (e.g., connected to the same CPU socket or PCIe switch) or be directly connected with an NVLink connection. While peer-to-peer communication will lead to much improved performance versus leaving MPI to handle the inter-GPU communication, it can useful for benchmarking and/or debugging to disable it. This can be done by setting the environment variable QUDA_ENABLE_P2P=0
.
GPU Direct RDMA and CUDA-aware MPI
QUDA can be optionally support GPU-aware MPI and GPU Direct RDMA (GDR), i.e., where data is passed directly to MPI without first copying it to the host, or conversely data is received directly into GPU memory. By default this option is disabled since passing a GPU pointer to an MPI library that is unaware of GPUs will lead to undefined behaviour (most likely a segmentation fault). This can be enabled by setting the environment variable QUDA_ENABLE_GDR=1
.
When doing so, you should ensure that the MPI library you are using is also GPU enabled and the network drivers support it. For Mellanox Infiniband, this means OFED v2.1 or v3.1 and later (which depends on which IB card is in your system). Details for Mellanox can be found here. As another note, GDR sometimes does not work if the GPUs are in exclusive mode.
On systems that do not support GDR, but are running a CUDA-aware MPI library, e.g., OpenMPI, MVAPICH2, then the MPI library can automatically stage the MPI buffers in CPU memory if provided with a GPU pointer. Typically letting the MPI library take care of this staging is slower than having QUDA do it since it introduces unnecessary synchronization. However, we note that on systems that do not have a NIC available, enabling GDR support and using this in combination with GPU-aware MPI can be beneficial for debugging, if not performance.
It should be noted that enabling GDR will never make the performance worse, since the dslash policy autotuner will automatically test all enabled policies, e.g., basic, GDR-enabled, etc., and pick the best one for each given precision, volume, etc. Details on the dslash policy tuning are given below.
OpenMPI
We recommend taking advantage of OpenMPI provided by your system administrator and working with them if you see sub-par performance. Below we give an example run script of how to use OpenMPI with GDR support and instructions for optimal process placement.
For reference, for ex. for local workstation experiments, instructions for building CUDA-aware OpenMPI can be found here and instructions for running CUDA-aware OpenMPI can be found here.
MVAPICH2
Instructions for running the current GDR-enabled MVAPICH2 can be found here. MVAPICH2-GDR is only available as a binary, but the source code for the regular CUDA-aware MVAPICH2 (with host message staging) is available.
To enable CUDA-awareness in MVAPICH2, if building from source you must set --enable-cuda
when running configure. When running, you must set the environment variable MV2_USE_CUDA=1
. Specific GDR-related instructions are below.
Cray MPI
The exact details of configuring QUDA for Cray MPI can vary from machine to machine. Instructions for Perlmutter are available here.
More legacy instructions that have not been tested recently are as follows:
To enable GPU-awareness on Cray's MPI you need to set the environment variable MPICH_RDMA_ENABLED_CUDA=1
. At present Cray's implementation provides no user control over which messages will be exchanged using RDMA versus using host staging. This means that MPI exchange can go through the CPU memory with no means to force enable RDMA. The end result is that only very small volumes will utilize RDMA on Cray's XC platform, which most likely means only coarse grids with multigrid.
We note that performance on Cray systems may be improved by enabling MPICH_NEMESIS_ASYNC_PROGRESS=1
, which enables results in the MPI library spawning threads to ensure the forward progress of asynchronous MPI calls (which QUDA utilizes).
Maximizing Communications Performance
On systems with multiple GPUs and multiple NIC, to ensure maximum GPU-NIC throughput, care must be taken to ensure that GPUs communicate with the closest NIC. This can be done by querying the topology of the machine you are running on, and then instrumenting your MPI and / or run script to ensure correct placement.
General Example
For example, when running on DGX-1, which is a system with 4x EDR NICs and 8x P100 GPUs. Each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication.
First of all, we query the node topology with nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity
GPU0 X NV1 NV1 NV1 NV1 SOC SOC SOC PIX SOC PHB SOC 0-19
GPU1 NV1 X NV1 NV1 SOC NV1 SOC SOC PIX SOC PHB SOC 0-19
GPU2 NV1 NV1 X NV1 SOC SOC NV1 SOC PHB SOC PIX SOC 0-19
GPU3 NV1 NV1 NV1 X SOC SOC SOC NV1 PHB SOC PIX SOC 0-19
GPU4 NV1 SOC SOC SOC X NV1 NV1 NV1 SOC PIX SOC PHB 20-39
GPU5 SOC NV1 SOC SOC NV1 X NV1 NV1 SOC PIX SOC PHB 20-39
GPU6 SOC SOC NV1 SOC NV1 NV1 X NV1 SOC PHB SOC PIX 20-39
GPU7 SOC SOC SOC NV1 NV1 NV1 NV1 X SOC PHB SOC PIX 20-39
mlx5_0 PIX PIX PHB PHB SOC SOC SOC SOC X SOC PHB SOC
mlx5_2 SOC SOC SOC SOC PIX PIX PHB PHB SOC X SOC PHB
mlx5_1 PHB PHB PIX PIX SOC SOC SOC SOC PHB SOC X SOC
mlx5_3 SOC SOC SOC SOC PHB PHB PIX PIX SOC PHB SOC X
Legend:
X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
We see that there are eight GPUs and four NICs as expected. The critical point is that GPU0 and GPU1 are both connected directly to mlx5_0 on the same PCIe switch, with GPU2 and GPU3 on mlx5_1, etc. So when launching our job on multiple nodes we need to ensure that processes mapped to the these GPUs are instructed to use these NICs.
Open MPI <= 3 (without UCX)
Moved to the legacy section here. Best practices is to use the most up-to-date version of OpenMPI with UCX.
OpenMPI 4 with UCX
While the general discussion given for OpenMPI 3 still applies the scriptsneed to be slightly modified to use the correct environment variables for UCX when using OpenMPI 4 with UCX. The run.sh
script for UCX looks like:
#!/bin/bash
# QUDA specific-environment variables
# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
export QUDA_ENABLE_TUNING=1
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5 6 7)
# This is the list of NICs we should use for each GPU
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3
NICS=(mlx5_0:1 mlx5_0:1 mlx5_1:1 mlx5_1:1 mlx5_2:1 mlx5_2:1 mlx5_3:1 mlx5_3:1)
# This is the list of CPU cores we should use for each GPU
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)
# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4
# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5 6 7)
# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})
APP="$EXE $ARGS"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
export UCX_NET_DEVICES=${NIC_REORDER[lrank]}
numactl --physcpubind=${CPU_REORDER[$lrank]} \
$APP
In the example above, the REORDER
variable tells us the order we want the GPUs to map to the local MPI process. Here we have only used the default ordering, e.g., REORDER=(0 1 2 3 4 5 6 7)
, which would produce an optimal mapping for a local 1x2x2x2 process topology (e.g., given the NVLink topology of DGX-1, GPU 0 can communicate with GPUs 1, 2 and 4 which are the only GPUs needed for this 3-d topology). However, if we were running with 1x1x2x4 local process topology (given that the default MPI process topology is ((pt*Nz + pz)Ny + py)Nx + px
then process 0 would need to be able to communicate with processes 1 (Z +/-), 2 (T+) and 6 (T-), but GPU 0 only has connections to GPUs 1, 2, 3, and 4.** So in this case, we would want to use REORDER=(0 1 2 3 6 7 4 5)
which would map GPU 4 to process 6 providing the optimal peer-to-peer connectivity matrix.
** This is the default for QUDA and MILC. BQCD on the other hand uses the inverse of this mapping ((px*Ny + py)Nz + pz)Nt + pt
. In this case, BQCD mapping would actually provide the optimal peer-to-peer mapping with the default GPU order.
Given the above binding script, the corresponding MPI launch command is then (note: update with latest version of UCX):
export UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc # select transports
export UCX_MEMTYPE_CACHE=n # see https://github.com/openucx/ucx/wiki/NVIDIA-GPU-Support
export UCX_RNDV_SCHEME=get_zcopy # improves GPUDirectRDMA performance
export UCX_RNDV_THRESH=131304 # your mileage may vary
mpirun
-np 48 # total number of processes
-npernode 6 # number of processes per node
--bind-to none # lets the user overrule binding using numactl
-hostfile ./hostfile # list of hosts we want to run on
-x EXE="./dslash_test" # executable
-x ARGS="--gridsize 2 2 2 6 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
./run.sh
MVAPICH2
For completeness, we give the equivalent scripts from above when using MVAPICH2-GDR with QUDA. Testing with the latest version of MVAPICH2-GDR (2.3alpha installation instructions are here) has shown issues with intra-node communication where either the sender or receiver are GPU pointers, leading to MVAPICH2-GDR seg faulting. This isn't a performance issue since QUDA's low-level handling of peer-to-peer communication within the node will almost certainly be superior to an MPI's implementation, however, since the Dslash policy tuner by default will test policies that include just handing a GPU pointer to MPI, we need to explicitly disable these policies from being tested. We can do this by setting the environment variable QUDA_ENABLE_P2P=7
, which will enable both copy-engine and direct store dslash communication policies but will disable the non-explicit P2P policies.
The equivalent launch and run scripts for DGX-1 are shown below. The main difference between OpenMPI and MVAPICH being that the latter relies on environment variables for setting MPI parameters.
#!/bin/bash
# MVAPICH environment variables
# Enable CUDA-aware MPI
export MV2_USE_CUDA=1
# Enable GDR
export MV2_USE_GPUDIRECT=1
# Set maximum GDR message size
export MV2_GPUDIRECT_LIMIT=4194304
# Enable GDRCOPY library (set to 0 if not installed on the system)
export MV2_USE_GPUDIRECT_GDRCOPY=1
export MV2_GPUDIRECT_GDRCOPY_LIB=/gpfs/sw/gdrdrv/install/lib64/libgdrapi.so
# Disable MVAPICH's internal affinity setting (since we'll do it manually using numactl)
export MV2_ENABLE_AFFINITY=0
export MV2_USE_MCAST=1
# Where $MPI_HOME should be the path to the MVAPICH installation
LD_PRELOAD+=:$MPI_HOME/lib64/libmpi.so
export EXE="./dslash_test" # executable
export ARGS="--gridsize 2 2 4 8 --dim 24 24 24 24 --prec double --niter 10000"
mpirun -np 128 -f hostfile ./run.sh
The equivalent run.sh is shown below, with the only difference being the environment variable for binding the NIC.
#!/bin/bash
# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
# enable GDR support
export QUDA_ENABLE_GDR=1
# disable non-P2P communication in the node
export QUDA_ENABLE_P2P=7
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5 6 7)
# This is the list of NICs we should use for each GPU
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3
NICS=(mlx5_0 mlx5_0 mlx5_1 mlx5_1 mlx5_2 mlx5_2 mlx5_3 mlx5_3)
# This is the list of CPU cores we should use for each GPU
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)
# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4
# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5 6 7)
# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})
APP="$EXE $ARGS"
lrank=$MV2_COMM_WORLD_LOCAL_RANK
export MV2_IBA_HCA=${NIC_REORDER[lrank]}
numactl --physcpubind=${CPU_REORDER[$lrank]} $APP
SpectrumMPI
A reference run script and binding script for Summit is given below. The script assumes that the environment variable APP
has been defined as the full executable plus arguments.
Run script:
#!/bin/bash -v
#BSUB -P XXX
#BSUB -W 2:00
#BSUB -nnodes 432
#BSUB -J jobname
#BSUB -o jobOut.%J
#BSUB -e jobErr.%J
##### -cn_cu 'maxcus=48' # Set to num nodes / 18 to constrain racks; reduces throughput
#BSUB -alloc_flags "smt4"
# submit with
# bsub run.lsf
nodes=432
ranks=$[${nodes} * 6]
export QUDA_ENABLE_GDR=1
export QUDA_RESOURCE_PATH=`pwd`/tunecache
mkdir -p $QUDA_RESOURCE_PATH
# Generally HISQ MG only
#export QUDA_ENABLE_DEVICE_MEMORY_POOL=0
#export QUDA_ENABLE_MANAGED_MEMORY=1
#export QUDA_ENABLE_MANAGED_PREFETCH=1
# Prepare executable name
EXE=...
ARGS=...
export APP="${EXE} ${ARGS}"
# Setup for jsrun
export OMP_NUM_THREADS=7
echo "START_RUN: `date`"
# each resource set is one entire nodes, with 6 total MPI ranks each,
# with complete visibility of all 6 GPUs and 42 CPU cores (both sockets)
jsrun --nrs ${nodes} -a6 -g6 -c42 -dpacked -b packed:7 --latency_priority gpu-cpu --smpiargs="-gpu" ./bind-6gpu.sh
echo "FINISH_RUN: `date`"
Binding script:
#!/bin/bash
lrank=$(($PMIX_RANK % 6))
echo $APP
case ${lrank} in
[0])
export PAMI_IBV_DEVICE_NAME=mlx5_0:1
numactl --physcpubind=0,4,8,12,16,20,24 --membind=0 $APP
;;
[1])
export PAMI_IBV_DEVICE_NAME=mlx5_0:1
numactl --physcpubind=28,32,36,40,44,48,52 --membind=0 $APP
;;
[2])
export PAMI_IBV_DEVICE_NAME=mlx5_0:1
numactl --physcpubind=56,60,64,68,72,76,80 --membind=0 $APP
;;
[3])
export PAMI_IBV_DEVICE_NAME=mlx5_3:1
numactl --physcpubind=88,92,96,100,104,108,112 --membind=8 $APP
;;
[4])
export PAMI_IBV_DEVICE_NAME=mlx5_3:1
numactl --physcpubind=116,120,124,128,132,136,140 --membind=8 $APP
;;
[5])
export PAMI_IBV_DEVICE_NAME=mlx5_3:1
numactl --physcpubind=144,148,152,156,160,164,168 --membind=8 $APP
;;
esac
GH200 Superchip: 1x Superchip per node
In the case of a node with a single superchip and one NIC, nvidia-smi topo -m
may return something along the lines of:
$ nvidia-smi topo -m
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX 0-71 0 1
NIC0 PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
This is representative and may not be exact. For this setup, an appropriate binding script is:
#!/bin/bash
# QUDA specific-environment variables
# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
export QUDA_ENABLE_TUNING=1
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
APP="$EXE $ARGS"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
export UCX_NET_DEVICES=mlx5_0:1
numactl --cpunodebind=0 --membind=0 $APP
Memory binding is critically important on 1xGH200 superchip nodes
4xGH200 Superchip Node
OpenMPI/UCX (Jupiter)
This set of instructions is relevant for FZJ/Jupiter, though the details of the NIC output below may not be exact. The output of nvidia-smi topo -m
may look like:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV6 NV6 NV6 SYS SYS SYS SYS SYS SYS 0-71 0 4
GPU1 NV6 X NV6 NV6 SYS SYS SYS SYS SYS SYS 72-143 1 12
GPU2 NV6 NV6 X NV6 SYS SYS SYS SYS SYS SYS 144-215 2 20
GPU3 NV6 NV6 NV6 X SYS SYS SYS SYS SYS SYS 216-287 3 28
NIC0 SYS SYS SYS SYS X SYS SYS SYS SYS SYS
NIC1 SYS SYS SYS SYS SYS X PIX SYS SYS SYS
NIC2 SYS SYS SYS SYS SYS PIX X SYS SYS SYS
NIC3 SYS SYS SYS SYS SYS SYS SYS X SYS SYS
NIC4 SYS SYS SYS SYS SYS SYS SYS SYS X SYS
NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
In this case, a representative binding script would be:
#!/bin/bash
# QUDA specific-environment variables
# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
export QUDA_ENABLE_TUNING=1
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3)
# This is the list of NICs we should use for each GPU
# e.g., associate GPU0 with MLX0, GPU1 with MLX1, GPU2 with MLX2 and GPU3 with MLX3
# The other NICs are included for completeness but are ignored
NICS=(mlx5_0:1 mlx5_1:1 mlx5_2:1 mlx5_3:1 mlx5_4:1 mlx5_5:1)
# This is the list of NUMA regions we should use for each MPI rank <-> GPU
# e.g., 4x72 core Grace CPUs
CPUS=(0 1 2 3)
# The number of threads to use for the calling app; may vary based on the CPU app.
# Grace has 72 physical cores.
export OMP_NUM_THREADS=16
# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3)
# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]})
APP="$EXE $ARGS"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
export UCX_NET_DEVICES=${NIC_REORDER[lrank]}
numactl --cpunodebind=${CPU_REORDER[$lrank]} --membind=${CPU_REORDER[$lrank]} $APP
Memory binding is critically important on 4xGH200 superchip nodes.
Dependence on CUDA_DEVICE_MAX_CONNECTIONS
There is an environment variable called CUDA_DEVICE_MAX_CONNECTIONS
, this controls how many hardware channels the GPU should use, e.g., how much work can be launched independently from different streams without any false dependencies. However, since it has the lowest latency, QUDA gets optimum performance at CUDA_DEVICE_MAX_CONNECTIONS=1 since this gives the lower latency and you still get overlap between kernel launches and memory copies in general due to the order in which these are issued. So in general, the advice is to set this parameter equal to 1 and this will provide optimal scaling.
Low-level Details
Dslash Policy Tuning
Since the optimum Dslash overlapping computation and communication strategy varies depending the machine you running on, the size of the problem you running, the precision, etc., QUDA implements multiple dslash execution policies and utilizes the autotuner to identify the optimal strategy for a given parameter set and use that policy for all subsequent invocations (dslash_policy.cuh). At present the following policies are enabled in QUDA:
QUDA_DSLASH=0
: bandwidth optimized - aim for maximum compute and comms overlap (one halo kernel per dimension)QUDA_FUSED_DSLASH=1
: kernel latency optimized - use a single halo update kernel for all dimensionsQUDA_GDR_DSLASH=2
: GDR-enabled variant of QUDA_DSLASHQUDA_FUSED_GDR_DSLASH=3
: GDR-enabled variant of QUDA_FUSED_DSLASHQUDA_GDR_RECV_DSLASH=4
: variant of QUDA_DSLASH which only enables GDR for the receiveQUDA_FUSED_GDR_RECV_DSLASH=5
: variant of QUDA_FUSED_DSLASH which only enables GDR for the receiveQUDA_ZERO_COPY_PACK_DSLASH=6
: write the non-p2p packed halo buffers directly to CPU memory for minimum MPI_Send latencyQUDA_FUSED_ZERO_COPY_PACK_DSLASH=7
: write non-p2p packed halo buffers directly to CPU memory and use fused halo kernelQUDA_ZERO_COPY_DSLASH=8
: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in halo update kernelsQUDA_FUSED_ZERO_COPY_DSLASH=9
: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in a single halo update kernelQUDA_ZERO_COPY_PACK_GDR_RECV_DSLASH=10
QUDA_FUSED_ZERO_COPY_PACK_GDR_RECV_DSLASH=11
QUDA_DSLASH_FUSED_PACK=12
: fused dslash and halo packer kernel, the first thread blocks in the grid will pack the halo buffer and subsequent blocks will apply the interior dslashQUDA_DSLASH_FUSED_PACK_FUSED_HALO=13
: fused dslash and halo packer kernel and fused halo update kernelQUDA_SHMEM_UBER_PACKINTRA_DSLASH=14
: NVSHMEM policy using an Uber-kernel with packing for intra-node, interior and exterior fused into a single kernel, separate kernel for inter-node packingQUDA_SHMEM_UBER_PACKFULL_DSLASH=15
: NVSHMEM policy using an Uber-kernel with packing, interior and exterior fused into a single kernelQUDA_SHMEM_PACKINTRA_DSLASH=16
: NVSHMEM policy with intra-node packing and interior fused into a single kernel, separate inter-node packing and exterior kernelsQUDA_SHMEM_PACKFULL_DSLASH=17
: NVSHMEM policy with packing and interior fused into a single kernel, separate exterior kernel
Note the GDR policies are only enabled if QUDA_ENABLE_GDR
is set. In most instances, you will just want to let the autotuner pick the best policy for your parameter set. However, you can restrict the policy set to tune other by setting the environment argument QUDA_ENABLE_DSLASH_POLICY
, e.g., setting QUDA_ENABLE_DSLASH_POLICY=1,3,5
would restrict the policy tuning to a subset of the "fused" variants only.
By default all policies will use peer-to-peer communication if available. To disable peer-to-peer, you set QUDA_ENABLE_P2P=0
. Finally we note that if between a tuned run and a subsequent run, either the QUDA_ENABLE_P2P
or QUDA_ENABLE_GDR
environment variables change, then the autotuner will exit, forcing a retune.
Dslash Component Benchmarking
In order to benchmark the components of the Dslash in isolation, QUDA can selectively disable portions of the Dslash computation. This is useful for example to benchmarking NIC performance, or to test kernel performance in the absence of communication. The dslash computation is broken down into multiple steps:
- packing: prepare contiguous halo buffers to be handed off to MPI / P2P communication
- comms: p2p cudaMemcpy within the node and MPI between nodes
- interior: apply the dslash stencil on the interior while the halo regions are being communicated
- exterior: once the comms have finished we finish the calculation with the application of the halo on the boundary elements (copy: when GDR / P2P is not available between a set of GPUs, then we have the additional D2H/H2D memcpys for staging the MPI buffers in CPU memory)
The following set of environment variables can be used to disable the various parts of computation and/or the communication. All of the below variables default to 1 (e.g., do the full calculation), but can be disabled by setting to 0 (obviously result will be wrong).
QUDA_ENABLE_DSLASH_PACK
- enable / disable initial packing kernelQUDA_ENABLE_DSLASH_COMMS
- enable/disable P2P memcpys and / or MPI exchangeQUDA_ENABLE_DSLASH_INTERIOR
- enable/disable interior kernel computationQUDA_ENABLE_DSLASH_EXTERIOR
- enable/disable exterior kernel computationQUDA_ENABLE_DSLASH_COPY
- enable/disable host staging copies for MPI if GDR/P2P not enabled
By combining the explicit policy choice with the above variables, we can benchmark in isolation any computation or communication pattern.
For communication benchmarking, the dslash_test and staggered_dslash_test programs will report the effective bi-directional bandwidth sustained by the algorithm (just grep the output for “bi”). See the below results for an example taken between two GPUs connected using PCIe peer-to-peer. With the full computation enabled we are unable to see what the actual achieved bi-directional bandwidth is, e.g., it plateaus once all the communications are hidden by the local computation, but when you only do the communications we see the expected behavior and the bi-directional bandwidth is saturating at 19 GB/s. What we can also see is the bandwidth doesn't saturate until a relatively large local volume: this is the motivation for future work using a SHMEM-style programming model where all peer-to-peer reads and write will be done by reading and writing directly to neighboring GPUs, which has significantly lower latency.
Legacy Information
Asymmetric Topologies
On asymmetric systems where some GPUs are on one side of the QPI bus and the NIC is on the other, care must be taken since the QPI bus cannot efficiently forward the memory traffic between attached PCIe devices. For example the following system has four GPUs and a NIC one socket, with two GPUs and not NIC on the other socket. This system will only give efficient GDR support for the first four GPUs, with the other two needing to stage their inter-node memory traffic explicitly through CPU memory.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 CPU Affinity
GPU0 X PIX PHB PHB SOC SOC PHB 0-9
GPU1 PIX X PHB PHB SOC SOC PHB 0-9
GPU2 PHB PHB X PIX SOC SOC PHB 0-9
GPU3 PHB PHB PIX X SOC SOC PHB 0-9
GPU4 SOC SOC SOC SOC X PHB SOC 10-19
GPU5 SOC SOC SOC SOC PHB X SOC 10-19
mlx5_0 PHB PHB PHB PHB SOC SOC X
To enable such a setup, the environment variable QUDA_ENABLE_GDR_BLACKLIST
can be used to exclude a given number of GPUs from using GDR, and instead will fallback to using explicit staging through CPU memory. The below is an example of how to do this for the above topology using OpenMPI.
mpirun
-np 48 # total number of processes
-npernode 6 # number of processes per node
--bind-to none # lets the user overrule binding using numactl
-hostfile ./hostfile # list of hosts we want to run on
--mca btl sm,self,openib # enable intra-node, loop back to self, and IB
--mca btl_openib_want_cuda_gdr 1 # enable GDR for MPI
--mca btl_openib_cuda_rdma_limit 1000000000 # set the largest message size for GDR
-x EXE="./dslash_test" # executable
-x ARGS="--gridsize 2 2 2 6 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
./run.sh
where run.sh would be as given below
#!/bin/bash
# QUDA specific-environment variables
# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
# enable GDR support
export QUDA_ENABLE_GDR=1
# exclude GPUs 4 and 5 from GDR since it's across QPI
export QUDA_ENABLE_GDR_BLACKLIST="4,5"
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5)
# This is the list of NICs we should use for each GPU
NICS=(mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0)
# This is the list of CPU cores we should use for each GPU
# e.g., 2x10 core CPUs split into 2 threads per process with correct NUMA assignment
CPUS=(1-2 3-4 5-6 7-8 10-11 15-16)
# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=2
# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5)
# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]})
APP="$EXE $ARGS"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}
numactl --physcpubind=${CPU_REORDER[$lrank]} $APP
Open MPI <= 3 (without UCX)
The script below (for OpenMPI) achieves that. To use this script with QUDA's dslash_test, running on 16 nodes of DGX-1, it would be launched with something like
mpirun
-np 128 # total number of processes
-npernode 8 # number of processes per node
--bind-to none # lets the user overrule binding using numactl
-hostfile ./hostfile # list of hosts we want to run on
--mca btl sm,self,openib # enable intra-node, loop back to self, and IB
--mca btl_openib_want_cuda_gdr 1 # enable GDR for MPI
--mca btl_openib_cuda_rdma_limit 1000000000 # set the largest message size for GDR
-x EXE="./dslash_test" # executable
-x ARGS="--gridsize 2 2 4 8 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
./run.sh # name of the below script
In the run.sh
script we set the order of CUDA devices as how they will be mapped to the local MPI ranks (the REORDER
variable). Given this order, we ensure that the closest NIC for a given process is assigned, and furthermore we set the the CPU cores available for each process to obtain the correct non-overlapping NUMA mapping.
#!/bin/bash
# QUDA specific-environment variables
# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5 6 7)
# This is the list of NICs we should use for each GPU
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3
NICS=(mlx5_0 mlx5_0 mlx5_1 mlx5_1 mlx5_2 mlx5_2 mlx5_3 mlx5_3)
# This is the list of CPU cores we should use for each GPU
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)
# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4
# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5 6 7)
# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})
APP="$EXE $ARGS"
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}
numactl --physcpubind=${CPU_REORDER[$lrank]} $APP