OpenMPI with UCX - lattice/quda GitHub Wiki
Some working notes for the latest version of OpenMPI built on top of UCX.
Grab latest version of UCX from here.
./configure --prefix=${UCX_HOME} --disable-cma --with-cuda=${CUDA_HOME} --with-gdrcopy=${GDRCOPY_HOME}
make -j
(sudo) make install
Grab OpenMPI 4.0.0
./configure --prefix=${OMPI_HOME} --enable-mpirun-prefix-by-default --with-cuda=${CUDA_HOME} --with-ucx=${UCX_HOME} --with-ucx-libdir=${UCX_HOME}/lib --enable-mpi-fortran=yes --disable-oshmem --with-pmix=internal
make -j
(sudo) make install
Options for mpirun
-x UCX_TLS=rc,sm,cuda_copy,cuda_ipc,gdr_copy
-x UCX_MEMTYPE_CACHE=n # disable pointer cache
At present using UCX with QUDA has a conflict since QUDA will open memory handles on the buffers for IPC, which then fails since UCX will try to create its own handles. The solution for now is to disable CUDA IPC communication for OpenMPI, either through UCX by not including the or through QUDA (e.g., cuda_ipc
option to UCX_TLS
QUDA_ENABLE_P2P=7
).
Compiling and running QUDA's regression tests to enable UCX testing
git clone https://github.com/lattice/quda.git
cd quda
mkdir ../build
cd ../build
QUDA_DIRAC_DEFAULT_OFF=1 cmake ../quda -DQUDA_DIRAC_WILSON=ON -DQUDA_MPI=ON -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 -DQUDA_FAST_COMPILE_REDUCE=1 -DQUDA_INTERFACE_MILC=OFF
make -j # wait a few minutes to build
If OpenMPI is not installed and visible in PATH
, then prefix the above cmake
invocation with CXX={PATH_TO_MPICC}/mpicxx CC={PATH_TO_MPICC}/mpicc
to ensure the correct MPI library is picked up.
The main test to use for testing CUDA-aware MPI with QUDA is dslash_ctest
. This can be run with something like
mpirun -np 3 tests/dslash_ctest --dim 4 6 8 10 --gridsize 1 1 1 3
This will run on 3 GPUs, assigning a local volume of 4x6x8x10 per GPU, on a process grid of size 1x1x1x3. The multiple of the process grid must match the number of processes passes to mpirun. It will cycle through many different communication patterns and should take a minute or two to run. All tests should PASS
and if anything shows FAIL
that's bad.
What works and what doesn't
Naive running - works
Just running naively, without opting in to CUDA-aware MPI and letting QUDA handle all intra-node communication just works
mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10
Enabling CUDA-aware MPI - does not work
Enabling QUDA to use CUDA-aware MPI in its tuning policies (QUDA_ENABLE_GDR=1
) and using OpenMPI/UCX IPC communication fails. This is because UCX tries to register the memory handles that QUDA already has.
QUDA_ENABLE_GDR=1 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10
The message is self explanatory. Issue is tracked here
cuda_ipc_cache.c:154 UCX Fatal: dest:13105: failed to open ipc mem handle. addr:0x7f5e173a5a00 len:45056 (Element already exists)
Enabling CUDA-aware MPI only for inter-node - mostly works
We can prevent prevent QUDA from using CUDA-aware MPI for intra-node communication with the env arg QUDA_ENABLE_P2P=7
. When used with QUDA_ENABLE_GDR=1
this would allow the use of RDMA between nodes, but direct CUDA IPC within the node. This works,
QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=7 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10
But if we don't explicitly specify the UCX_TLS
parameters then we get an error message in MPI_Finalize
:
reloc.c:327 UCX FATAL could not find address of original cudaHostUnregister(): Unknown error
Enable MPI CUDA-awareness, disable QUDA's direct IPC - Immediate crash and burn on first communication
Here we rely on MPI to handle all intra-node communication, switching off CUDA IPC for QUDA. This dies on first communication
QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=0 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10
where the error given is as follows
[nvsocal2:12651:0:12651] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f813b3a3c40)
Enable MPI CUDA-awareness, disable QUDA's direct IPC, disable pointer cache
Same as above, just disabling the pointer cache. This works fine.
QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=0 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc,cuda_copy -x UCX_MEMTYPE_CACHE=n tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10
E.g., if using UCX for CUDA-aware communication, then disable the pointer cache
Enable MPI CUDA-awareness, disable QUDA's direct IPC, disable UCX cuda_copy
This suggests that the issue above with the pointer cache is related to the cuda_copy
communication protocol.
QUDA_ENABLE_GDR=1 QUDA_ENABLE_P2P=0 mpirun -np 3 -x UCX_TLS=sm,cuda_ipc tests/dslash_ctest --gridsize 1 1 1 3 --dim 4 6 8 10
There is an initial error message on first communication, but after that, things seem to complete without issue
`select.c:406 UCX ERROR no copy across memory types transport to <no debug data>: Unsupported operation`