Multi GPU with NVSHMEM - lattice/quda GitHub Wiki

Multi-GPU support with NVSHMEM

One systems with Infiniband Networks and CUDA 11 (or later) QUDA can use NVSHMEM for communication to reduce communication overheads and improve scaling.

Note that this works on top of QMP/MPI and does not replace these.

No changes are required in application that use QUDA

What is NVSHMEM

NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.

To learn more about it check out https://developer.nvidia.com/nvshmem.

QUDA and NVSHMEM

For QUDA the device-side communication can significantly reduce overheads from CPU and GPU synchronization and improve compute and communication overlap. This reduces latencies and improve strong scaling.

More details can be found in: https://developer.nvidia.com/gtc/2020/video/s21673-vid

Building with NVSHMEM

To build with NVSHMEM in addition to MPI/QMP you also need to have NVSHMEM (version 2 or later) installed. This can either be

Already installed as a module / system wide by your system administrator. In this case just be sure to set the environment variable NVSHMEM_HOME to the install directory.
You can install it in yourself (see below)
Rely on QUDA building NVSHMEM (experimental, limited support)

Building NVSHMEM yourself

You can get NVSHMEM from NVSHMEM Download page.

Detailed instructions on how to build and install NVSHMEM are available in the installation guide.

We recommend following the instructions there and use the default build settings.

Make sure to leave NVSHMEM_MPI_SUPPORT=1(default) enabled in the build process.
Check with your system administrator if GDRCOPY is available on your system and its installation location to set GDRCOPY_HOME.

After you have built NVSHMEM set the environment variable NVSHMEM_HOME to the installation location.

The build command for building with OpenMPI, CUDA and GDRCOPY available in respective directories in /usr/local look like:

CUDA_HOME=/usr/local/cuda GDRCOPY_HOME=/usr/local/gdrcopy MPI_HOME=/usr/local/openmpi NVSHMEM_MPI_SUPPORT=1 NVSHMEM_PREFIX=/usr/local/nvshmem make -j4 install
export NVSHMEM_HOME=/usr/local/nvshmem

For Summit-specific information, check the Summit section below.

Building QUDA with NVSHMEM

To build QUDA with NVSHMEM we assume that you have already installed NVSHMEM yourself or it has been installed by your system adminstrator. To enable NVSHMEM during the QUDA build enabled

cmake -DQUDA_NVSHMEM=ON -DQUDA_MPI=ON [...]

or if you use QMP

cmake -DQUDA_NVSHMEM=ON -DQUDA_QMP=ON [...]

Cmake will try to pre-populate QUDA_NVSHMEM_HOME with the value from the environment variable NVSHMEM_HOME. If you have not set NVSHMEM_HOME or it fails for whatever reason you can also pass

-DQUDA_NVSHMEM_HOME=/path/to/nvshmem_install

Running with NVSHMEM

NVSHMEM communication is enabled by default and the quda autotuner will use it as it sees fit. No further action is required. It is however recommended to follow the best practices for RDMA performance described at Maximizing GDR performance to make sure you use proper binding of CPU and HCA.

NVSHMEM is usually smart enough to use the correct HCA adapter without explicit binding. Note that any environment variables for HCA binding from MPI/UCX are not respected by NVSHMEM and you should use NVSHMEM_ENABLE_NIC_PE_MAPPING and NVSHMEM_HCA_PE_MAPPING. See NVSHMEM environment variables for details.

For troubleshooting NVSHMEM issues we also recommend the FAQ.

NVSHMEM Dslash policies

TODO

At runtime, you can opt out of using NVSHMEM by setting the environment variable QUDA_ENABLE_NVSHMEM=0, where the default value is equivalent to one. This will disable all use of NVSHMEM Dslash policies, relying purely on CUDA IPC and MPI message exchange as supported by the system being run used.

NVSHMEM on Summit

We recommend building and installing your own version of NVSHMEM from source when running on Summit, which is a standard MPI install. The challenge on Summit is finding the locations of GDRCopy, MPI, etc, to pass to the NVSHMEM make command as described above. These instructions have been tested with the following modules:

$ module list

Currently Loaded Modules:
  1) lsf-tools/2.0   3) darshan-runtime/3.3.0-lite   5) cuda/11.0.3   7) spectrum-mpi/10.4.0.3-20210112   9) cmake/3.20.2                11) nsight-compute/2021.2.1
  2) hsi/5.0.2.p5    4) DefApps                      6) gcc/9.3.0     8) git/2.31.1                      10) nsight-systems/2021.3.1.54  12) gdrcopy/2.2

The modules cuda/11.0.3, gcc/9.3.0, spectrum-mpi/10.4.0.3-20210112, and gdrcopy/2.2 are what's most relevant, though there is likely freedom to choose other versions of GCC. Other versions of CUDA are not officially supported on Summit and thus we do not consider them here.

With these options, the environment variables passed to make for NVSHMEM are (as of May 25, 2022):

CUDA_HOME=/sw/summit/cuda/11.0.3 (parsed from which nvcc)
GDRCOPY_HOME=/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/gdrcopy-2.2-xk2w6ftqfas57fuzgcxcc7p5pebgthth (parsed from echo $LD_LIBRARY_PATH)
MPI_HOME=/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi (parsed from which mpicc)

NVSHMEM can then be built and installed locally by using the make command above, for ex:

export NVSHMEM_HOME=/my/path/to/nvshmem
CUDA_HOME=[..] GDRCOPY_HOME=[...] MPI_HOME=[...] NVSHMEM_MPI_SUPPORT=1 NVSHMEM_PREFIX=$NVSHMEM_HOME make -j4 install

Be sure to append the libraries directory of the install directory to your LD_LIBRARY_PATH environment variable appropriately, either in your default environment and/or in your LSF submit script via

export LD_LIBRARY_PATH="${NVSHMEM_HOME}/lib:${LD_LIBRARY_PATH}"