MILC with QUDA - lattice/quda GitHub Wiki

These instructions are intended to be a quick start guide to getting MILC running with GPUs using the QUDA library.

These instructions assume you are using the recommended branches of QUDA and MILC, develop in both cases. For instructions on using MILC with the current stable release branch (not recommended), release/1.0.x, see the notes at the bottom of this page.

For extra Perlmutter-specific information, check the QUDA on Perlmutter page.

Obtaining and compiling QUDA

You can obtain QUDA using the following:

git clone --branch develop https://github.com/lattice/quda.git

As noted above, the current recommended branch of QUDA is the develop branch, which is the git default.

QUDA uses cmake to set compilation options. For running with HISQ fermions, e.g., the su3_rhmc_hisq test that is commonly used in MILC, the minimal suggested configuration is

mkdir build
cd build
cmake ../quda -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_GPU_ARCH=sm_80 -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
              -DQUDA_QMP=ON -DQUDA_QIO=ON -DQUDA_DOWNLOAD_USQCD=ON

Above, we implicitly assume that the CUDA and MPI compilers are present in the $PATH. Here we are setting the GPU architecture to sm_80 which corresponds to NVIDIA Ampere. Choices include:

sm_35 for Kepler (Tesla K20 / K40 / K80)
sm_60 for Pascal (Tesla P100, Quadro GP100)
sm_70 for Volta (Tesla V100, Quadro V100)
sm_80 for Ampere (NVIDIA A100)
sm_90 for Hopper (NVIDIA H100)

Here we are disabling unnecessary parts of QUDA when used with MILC, in order to reduce compilation time. The final three arguments concern the installation of the USQCD companion libraries QMP and QIO. QUDA can automate their download and installation, and that is what we have enabled here. You can optionally specify an install directory with -DCMAKE_INSTALL_PREFIX=[path], though for MILC bindings it's sufficient to just work from the build directory.

To build QUDA, you should use a parallel build as QUDA can take a long time to build,

make -j N

where N is the number of cores / threads that the compilation node has. We typically recommend setting this to the number of hardware threads (e.g., hyperthreads) in the system. If you have set an install path when running cmake (-DCMAKE_INSTALL_PREFIX=[path]), then to complete the installation run

make install

Note that this is required for the development version of QUDA since Nov 2021. If you don't specify -DCMAKE_INSTALL_PREFIX=[path]) it will install in <build_dir>/usqcd.

Finally note that when building with OpenMPI 4.x and above, due to the use of the deprecated MPI_Type_struct, the the QMP version needs to be 2.5.3 (or higher). This happens automatically when using -DQUDA_DOWNLOAD_USQCD=ON. Alternatively, you can use an older version of QMP if you compile MPI with the compatibility option --enable-mpi1-compatibility, though in practice this should be unnecessary.

Obtaining and compiling MILC

For use with QUDA we recommend the present develop branch of MILC. This enables the maximum benefit of QUDA acceleration.

git clone --branch develop https://github.com/milc-qcd/milc_qcd.git

To aid compilation of MILC with QUDA support, there is a provided helper script for the su3_rhmd_hisq application ks_imp_rhmc/compile_su3_rhmd_hisq_quda.sh. Editing this script as appropriate and executing from its directory should result in a full build of MILC with QUDA acceleration for the desired application. For a standard build the important settings are CUDA_HOME, QUDA_HOME, QIOPAR and QMPPAR, where QUDA_HOME can point to the build or install directory for QUDA. It is trivial to modify this script to accommodate building different executables, e.g., replacing the su3_rhmd_hisq executable name in the script with the desired one.

Note that we need to point MILC to the installed QMP and QIO as part of the QUDA installation, these will be located in the usqcd directory in the QUDA build directory.

After modifying ks_imp_rhmc/compute_su3_rhmd_hisq_quda.sh appropriately, the MILC RHMC driver can be built via

cd ks_imp_rhmc
cp ../Makefile .
./compile_su3_rhmd_hisq_quda.sh

The build of MILC should now be complete.

Running MILC with QUDA

Typically, running MILC with QUDA is exactly like running MILC without QUDA. There is a one-to-one mapping between the number of GPUs and the number of MPI processes in the system. If you have followed the above instructions to build QUDA, then QUDA will have been built as a shared library. You do not need to update LD_LIBRARY_PATH to point to the shared library when using the current develop version of QUDA and MILC due to the use of rpath.

Typically, the CUDA Multi-Process Service (MPS) should not be enabled as this will only decrease performance. An exception to this could be if running on a system with many CPU cores, and MPI performance is superior to OpenMP performance. Otherwise just set OMP_NUM_THREADS (or equivalent) to the number of cores available per process (per GPU).

Set a location for QUDA to write out its autotuning cache: e.g.,

export QUDA_RESOURCE_PATH=/tmp

On the first run QUDA will dump the kernel launch parameters here, for use in later runs. Thus to get optimum performance you should do first a tuning run, and then do a benchmarking run afterwards. This path should be set to a location that is accessible by whichever nodes are running the executable.

Performance Tuning

For guidelines in how to improve strong scaling performance (fixed problem size as the number of GPUs is increased) you can refer to these pages quick-start and multi-gpu.

By default MILC will attempt to split the problem between processes in order to minimize the surface-to-volume ratio of the local problem size. This is in general a good thing to do, however, MILC favours partitioning the fastest running X dimension rather than that slowest running T dimension first. This is bad for running on modern architectures since it leads to strided memory accesses when doing the X-face halo update. The process grid topology can be set manually making it easy to override this, using the command-line option

-qmp-geom MX MY MZ MT

to specify a partitioning of the X axis in MX equal segments, the Y axis into MY segments, etc. So, for example, with a lattice size 32x32x32x64 and 8 MPI ranks the command

mpirun -np 8 ./su3_rhmc_hisq -qmp-geom 1 1 2 4 ...(other options)

would result in local volumes of 32x32x16x16 on a 1x1x2x4 grid of virtual processors. Without this additional flag, the process topology would default to local problem size 16x16x32x32 (partitioning in T first since it has length 64, then split from the X dimension upwards) which leads to strided memory accesses.

With QMP-2.5.1 and above, users can control the logical topology, helping improve inter/intra node layout. In addition to the regular QMP args (-qmp-geom x y z t), one can also pass two new args, -qmp-logic-map and -qmp-alloc-map, which control the process coordinate to rank mapping, for the following two examples

mpirun -np 8 ./su3_rhmc_hisq -qmp-geom 1 1 2 4 -qmp-logic-map 0 1 2 3 -qmp-alloc-map 0 1 2 3 ...(other options)
mpirun -np 8 ./su3_rhmc_hisq -qmp-geom 1 1 2 4 -qmp-logic-map 3 2 1 0 -qmp-alloc-map 3 2 1 0 ...(other options)

If we assume that the MPI launcher packs adjacent ranks onto the same node, the first invocation would result in the T process coordinate equal to rank/2 and the Z coordinate equal to rank%2. Conversely, the second invocation would have time as T = rank%4 and Z = rank/4. In general the user should ensure that the two map arguments have identical parameters.

See also the NERSC-MILC page for further details about launching large-scale jobs.

Compiling MILC with the QUDA 1.0.x release branch

While it is not recommended at this time, if you need to compile MILC with a 1.0.x release branch, you need to use a specific (sufficiently old) MILC commit, and a custom QUDA cmake command. In short, the requirements are:

QMP: git clone --branch qmp2-5-2 https://github.com/usqcd-software/qmp.git
QIO: git clone --branch qio2-5-0 --recurse-submodules https://github.com/usqcd-software/qio.git
QUDA: git clone --branch release/1.0.x https://github.com/lattice/quda.git
MILC: git clone https://github.com/milc-qcd/milc_qcd.git && cd milc_qcd && git checkout b0bb4c52a567c722d6d70292bb7ff60da44627b4 && cd ..

Note that, if you are taking advantage of the QUDA_DOWNLOAD_USQCD flag, you do not need to manually download QMP and QIO. The cmake configuration of QUDA requires some changes relative to the flags given for the develop branch above:

mkdir build
cd build
cmake ../quda -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_GPU_ARCH=sm_70 -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_DIRAC_CLOVER=OFF -DQUDA_DIRAC_DOMAIN_WALL=OFF -DQUDA_DIRAC_TWISTED_CLOVER=OFF \
  -DQUDA_DIRAC_TWISTED_MASS=OFF -DQUDA_DIRAC_WILSON=OFF \
  -DQUDA_BUILD_SHAREDLIB=ON \
  -DQUDA_QMP=ON -DQUDA_QIO=ON -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_DOWNLOAD_QIO_LEGACY=ON

The main changes are:

Manually disabling DIRAC types beyond staggered/HISQ fermions.
Overriding the version of QIO downloaded to use qio2-5-0 as described above.

The instructions for building MILC are unchanged, just be mindful of the specific git commit id above. Finally, when running MILC with this version of QUDA, you will need to include the path to the QUDA shared library in your LD_LIBRARY_PATH.

Note that the comment on compiling OpenMPI 4.x with --enable-mpi1-compatibility above still applies, since this requires QMP 2.5.2. Alternatively, one can trivially edit the QMP source code to change the single occurrence of MPI_Type_struct to MPI_type_create_struct in usqcd/src/QMP/lib/mpi/QMP_mem_mpi.c.