OpenMPI with UCX - lattice/quda GitHub Wiki

Some working notes for the latest version of OpenMPI built on top of UCX.

Grab latest version of UCX from here.

./configure --prefix=${UCX_HOME} --disable-cma --with-cuda=${CUDA_HOME} --with-gdrcopy=${GDRCOPY_HOME}
make -j
(sudo) make install

Grab OpenMPI 4.0.0

./configure --prefix=${OMPI_HOME} --enable-mpirun-prefix-by-default --with-cuda=${CUDA_HOME} --with-ucx=${UCX_HOME} --with-ucx-libdir=${UCX_HOME}/lib --enable-mpi-fortran=yes --disable-oshmem --with-pmix=internal
make -j
(sudo) make install

Options for mpirun

-x UCX_TLS=rc,sm,cuda_copy,cuda_ipc,gdr_copy
-x UCX_MEMTYPE_CACHE=n # disable pointer cache

At present using UCX with QUDA has a conflict since QUDA will open memory handles on the buffers for IPC, which then fails since UCX will try to create its own handles. The solution for now is to disable CUDA IPC communication for OpenMPI, either through UCX by not including the cuda_ipc option to UCX_TLS or through QUDA (e.g., QUDA_ENABLE_P2P=7).

Compiling and running QUDA's regression tests to enable UCX testing

git clone https://github.com/lattice/quda.git
cd quda
mkdir ../build
cd ../build
QUDA_DIRAC_DEFAULT_OFF=1 cmake ../quda -DQUDA_DIRAC_WILSON=ON -DQUDA_MPI=ON -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 -DQUDA_FAST_COMPILE_REDUCE=1 -DQUDA_INTERFACE_MILC=OFF
make -j # wait a few minutes to build

If OpenMPI is not installed and visible in PATH, then prefix the above cmake invocation with CXX={PATH_TO_MPICC}/mpicxx CC={PATH_TO_MPICC}/mpicc to ensure the correct MPI library is picked up.

The main test to use for testing CUDA-aware MPI with QUDA is dslash_ctest. This can be run with something like

mpirun -np 3 tests/dslash_ctest --dim 4 6 8 10 --gridsize 1 1 1 3

This will run on 3 GPUs, assigning a local volume of 4x6x8x10 per GPU, on a process grid of size 1x1x1x3. The multiple of the process grid must match the number of processes passes to mpirun. It will cycle through many different communication patterns and should take a minute or two to run. All tests should PASS and if anything shows FAIL that's bad.

What works and what doesn't

Naive running - works

Just running naively, without opting in to CUDA-aware MPI and letting QUDA handle all intra-node communication just works

mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10

Enabling CUDA-aware MPI - does not work

Enabling QUDA to use CUDA-aware MPI in its tuning policies (QUDA_ENABLE_GDR=1) and using OpenMPI/UCX IPC communication fails. This is because UCX tries to register the memory handles that QUDA already has.

QUDA_ENABLE_GDR=1 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10

The message is self explanatory. Issue is tracked here

cuda_ipc_cache.c:154  UCX Fatal: dest:13105: failed to open ipc mem handle. addr:0x7f5e173a5a00 len:45056 (Element already exists)

Enabling CUDA-aware MPI only for inter-node - mostly works

We can prevent prevent QUDA from using CUDA-aware MPI for intra-node communication with the env arg QUDA_ENABLE_P2P=7. When used with QUDA_ENABLE_GDR=1 this would allow the use of RDMA between nodes, but direct CUDA IPC within the node. This works,

QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=7 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10

But if we don't explicitly specify the UCX_TLS parameters then we get an error message in MPI_Finalize:

reloc.c:327  UCX  FATAL could not find address of original cudaHostUnregister(): Unknown error

Enable MPI CUDA-awareness, disable QUDA's direct IPC - Immediate crash and burn on first communication

Here we rely on MPI to handle all intra-node communication, switching off CUDA IPC for QUDA. This dies on first communication

QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=0 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10

where the error given is as follows

[nvsocal2:12651:0:12651] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f813b3a3c40)

Enable MPI CUDA-awareness, disable QUDA's direct IPC, disable pointer cache

Same as above, just disabling the pointer cache. This works fine.

QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=0 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy -x UCX_MEMTYPE_CACHE=n tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10

E.g., if using UCX for CUDA-aware communication, then disable the pointer cache

Enable MPI CUDA-awareness, disable QUDA's direct IPC, disable UCX cuda_copy

This suggests that the issue above with the pointer cache is related to the cuda_copy communication protocol.

QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=0 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10

There is an initial error message on first communication, but after that, things seem to complete without issue

`select.c:406  UCX  ERROR no copy across memory types transport to <no debug data>: Unsupported operation`