Installing MVAPICH2 GDR - lattice/quda GitHub Wiki

Installing MVAPICH2-GDR

To install MVAPICH2-GDR, the easiest thing to do is to grab the rpm and install it in your home directory.

E.g., on x86, with CUDA 9.2 and OFED 4.3 this would be the appropriate rpm (download links available here)

wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3a/mofed4.3/mvapich2-gdr-mcast.cuda9.2.mofed4.3.gnu4.8.5-2.3a-2.el7.x86_64.rpm

For example, you can then unpack the rpm into your home directory

rpm2cpio mvapich2-gdr-mcast.cuda9.2.mofed4.3.gnu4.8.5-2.3a-2.el7.x86_64.rpm | cpio -id

When this approach is taken, one has to manually edit the mpicc, mpicxx, mpifort wrapper scripts to point to the correct locations since these assume that MVAPICH has been installed in /opt/... and not in ${HOME}/opt/.... (we can use a different delimiter for sed so as not to have to escape /, e.g. :)

sed -i -e 's:/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.2/mofed4.3/mpirun/gnu4.8.5:${HOME}/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.2/mofed4.3/mpirun/gnu4.8.5:g' mpicc mpicxx mpic++ mpif90 mpifort

Moreover, these wrappers also expect CUDA to be installed in /usr/local/cuda-9.2 but this may not be the case on all platforms, e.g., where modules are used. With this in mind, it is best to changes this to an environment variable that is easily overwritten

sed -i -e 's:/usr/local/cuda-9.2:${CUDA_HOME}:g' mpicc mpicxx mpic++ mpif90 mpifort

Now to use this, we can just add the bin and lib64 directories to our PATH and LD_LIBRARY_PATH, and we can compile as normal.

export MPI_HOME=${HOME}/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.2/mofed4.3/mpirun/gnu4.8.5
export PATH=${MPI_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${MPI_HOME}/lib64:$LD_LIBRARY_PATH

Finally, it has been observed on systems that don't include the CUDA library path automatically in the user's LD_LIBRARY_PATH, e.g., on systems with modules, that a shared library link error can occur when trying to use mpirun with multi-node.

hydra_pmi_proxy: error while loading shared libraries: libcudart.so.9.2: cannot open shared object file: No such file or directory

To fix this, the user's LD_LIBRARY_PATH should be set to include ${CUDA_HOME}/lib64 in .bashrc to ensure that all remote logins explicitly have this.

GDR Copy

While MVAPICH2 supports the GDR Copy library for extremely low latency for small messages, this isn't actually need for running. Moreover, since this is only applicable for very small messages, for most LQCD runs, this is likely not needed (perhaps it will be beneficial for multigrid). If you don't have it installed, you can disable it by setting the environment variable export MV2_USE_GPUDIRECT_GDRCOPY=0.