Running the NERSC MILC Benchmarks - lattice/quda GitHub Wiki

Obtaining the Benchmarks

For details on compiling both QUDA and MILC see MILC with QUDA. The NERSC MILC benchmarks can be obtained here, for example via

> wget https://portal.nersc.gov/project/m888/apex/MILC_160413.tgz
[...]
> tar xvzf MILC_160413.tgz
[...]

This tarball includes only an outdated version of the MILC source code and run scripts. As described in MILC with QUDA, one should use the latest develop branch of MILC. Untarring the NERSC MILC benchmark tar produces a directory MILC-apex.

With these obtained, the input lattice files can be obtained from here. The relevant input lattice (.chklat file) for the desired benchmark should be placed directly in the MILC-apex/benchmarks/lattices directory.

Benchmark	Volume	File	Link
Small	18x18x18x36	18x18x18x36.chklat	here
Medium	36x36x36x72	36x36x36x72.chklat	here
Large	72x72x72x144	72x72x72x144.chklat	here
Grand Challenge	144x144x144x288	144x144x144x288.chklat	here

For reference: the Medium problem size can fit on one 32GB GPU. By extension, the Large problem size can fit on 16 such GPUs, and the Grand Challenge can fit on 256 GPUs. The Small problem is too small to effectively saturate a modern GPU and should only be used for diagnostics/debugging.

Background

For general running tips when using QUDA, we refer to QUDA start guide and multi-GPU guide. Note that on first call to any given GPU kernel, QUDA will autotune the launch parameters for optimal performance, and save these parameters to disk for subsequent reuse. The default name for this cachefile is tunecache.tsv, and it is saved in the directory specified in the environment variable QUDA_RESOURCE_PATH as noted in the QUDA start guide.

Autotuning does add some overhead that should not be counted in benchmarking. We recommend therefore doing an initial short tuning run before then rerunning with the original setting for a benchmarking run. This can be accomplished by, for example, modifying the line steps_per_trajectory 30 in MILC-apex/benchmarks/milc_in.sh to steps_per_trajectory 2 for the tuning run only. Please note that there are two steps_per_trajectory lines, the first one setting it to 20. That line is not relevant for the benchmark and can be ignored.

The expected performance from running the MILC benchmarks will depend greatly on the node topology of the system being run. Optimal performance will be obtained on systems with first-class GPU Direct RDMA performance, and to a lesser degree, systems that have fully-connected peer-to-peer node topologies. Update this example: For example, Piz Daint can complete the large problem benchmark in around ~~6300~~ (update this old number) seconds on 96 nodes. There the limiting factor is the lack of GPU Direct RDMA performance in the 100 KiB-2 MiB message size regime which is critical for scalable performance. This compares to DGX-1 Pascal, which can complete the benchmark in around the same time on half the total number of GPUs owing to improved GPU Direct RDMA performance.

Running the Benchmarks

The benchmarks are located in the MILC-apex/benchmarks directory. As noted above, included with the tarball are the unmodified runscripts and an old version of the MILC source code. Since we will use a separate MILC installation, we need to update the soft links in MILC-apex/benchmarks/*/su3_rhmd_hisq directories to point to the su3_rhmd_hisq binary in the QUDA-enabled build of MILC (milc_qcd/ks_imp_rhmc) directory.

Assuming the necessary input lattices have been fetched using the instructions we are now in a position to run the benchmarks. For quick testing, only the minimal changes listed below are needed. For running at scale (strong scaling, minimizing communication overheads in general), please refer to the subsequent instructions on performance tuning.

Quick Testing

As an example, to run the small benchmark on a 3x GPU node using MPI, with a host 6-core process, one need only change the the following lines from the top of run_small.sh:

  total_nodes=1              # one node
  cores_per_node=6           # six core processor
  numa_per_node=1            # single socket so one numa (also applies to dual-socket with optimized bindings, see below)
  threads_per_rank=2         # two threads per rank
  run_type=mpirun            # launch job using mpirun

In total the number of MPI ranks will be $total_nodes * $cores_per_node / $threads_per_rank. Depending on your MPI distribution you may need to modify the mpirun command on line 18 of milc_in.sh. For example, the script uses -env to pass environment variables to mpirun, whereas OpenMPI expects -x. The I_MPI_PIN_DOMAIN socket setting in the script is for MPICH-based MPI, so when running on OpenMPI, this command should not be included since otherwise this will cause a runtime error. As a note, we will add comments on core/socket bindings in the subsequent section.

Last, before running, you should set the below environment variables. You can either set them on your shell (modifying export as appropriate if you aren't using bash), or modify run_[small/medium/etc].sh before the run_milc function at the bottom of the script:

export LD_LIBRARY_PATH=${PATH_TO_QUDA_INSTALL}/lib # set path to QUDA lib, only necessary for v1.0.x tags
export QUDA_RESOURCE_PATH=.                        # set path to write out tunecache.tsv, can be anything
export QUDA_MILC_HISQ_RECONSTRUCT=13               # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9         # set QUDA-MILC solver optimization
export OMP_NUM_THREADS=2                           # here two threads per process,
                                                   # only relevant for the small parts of the benchmark offloaded to the CPU

More information on these environment variables are available on the QUDA Environment Variables page. The small benchmark can now be run via:

./run_small.sh

For reference, as of the time of writing, this benchmark should take less than 300 seconds to complete on a 3x Quadro V100 system.

Tuning notes on single GPU NERSC Medium runs

The NERSC Medium benchmark is "small" enough to run on a single NVIDIA GPU with >= 32 GB of onboard memory (Quadro V100, Tesla V100-32GB, NVIDIA A100-40GB and -80GB). However, due to the large local volume, autotuning can take upwards of 30 minutes to an hour, during which no output gets printed. Since this is a long time, it may appear that the benchmark is hanging. You can verify that it is still tuning by running nvidia-smi on the same node as a proxy for querying GPU activity. In any case, please make sure you have given at least 1 hour to tune (either interactively, or via a wall-time of at least 1 hour) before reaching out about the benchmark appearing to hang.

The NERSC Medium benchmark can be run on single GPUs with less than 32 GB of onboard memory by disabling the device pool manager and enabling managed memory support. This is only recommended if absolutely necessary. Do not run this way "for fun". This can be accomplished by setting the following two environment variables:

export QUDA_ENABLE_DEVICE_MEMORY_POOL=0
export QUDA_ENABLE_MANAGED_MEMORY=1

More information on these environment variables is available on the QUDA Environment Variables page.

Multi-GPU systems

See the discussion here for instructions on how to use OpenMPI/UCX and/or MVAPICH MPI.

On systems with multiple GPUs and / or multiple NICs, it is important to bind each process to the correct CPU cores, and to ensure that a given GPU uses the correct NIC to minimize on-node memory traffic and maximize inter-node bandwidth. Peer-to-peer communication and GPU Direct RDMA performance are critical for the scalability of MILC. In order to achieve correct NUMA placement we have found numactl to be most robust method on systems with multiple GPUs and CPUs.

The easiest way to accomplish this is a wrapper script called by mpirun that correctly sets bindings (such as process affinity and NIC assignments) and launches the MILC RHMC executable. A sample reference script, and a description of its derivation, is given here. We will refer to such a script as run.sh.

First, for simplicity, we replace the milc_in.sh script, which generates the input file "on the fly" and calls mpirun/srun/etc, with a static input file which can be passed to the MILC RHMC executable directly. The input file, with diff shown between the contents of milc_in.sh and the modifications for the "medium" problem size, is given below.

    prompt 0
-   nx $nx
-   ny $ny
-   nz $nz
-   nt $nt
+   nx 36
+   ny 36
+   nz 36
+   nt 72
    iseed 4563421
    n_pseudo 5
    load_rhmc_params rationals_m001m05m5.test1
    beta 6.3
    n_dyn_masses 3
    dyn_mass .001 .05 .5
    dyn_flavors 2 1 1
    u0  0.890

-   warms $warms
+   warms 0
    trajecs 0
    traj_between_meas 40
    microcanonical_time_step .05
    steps_per_trajectory 20
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 0
    fresh
-   $save_cmd
+   forget

    warms 0
-   trajecs $trajecs
+   trajecs 2
    traj_between_meas 2
    microcanonical_time_step .0333
    steps_per_trajectory 30
    cgresid_md_fa_gr 1e-5 1e-5 1e-5
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 3
    max_cg_prop 500
    max_cg_prop_restarts 1
    npbp_reps 1
    prec_pbp 2
    mass 0.01
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.05
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.5
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
-   $reload_cmd
+   reload_parallel 36x36x36x72.chklat
    forget

The values for $nx, $ny, $nz, $nt, $warms, $save_cmd, $trajecs, and $reload_cmd are all based on the values set in run_medium.sh. We note that everything except the problem sizes and reload command are fixed across different problem sizes. The values for different problem sizes are as follows:

Benchmark	`$nx`, `$ny`, `$nz`	`$nt`	`$reload_cmd`
Small	18	36	`reload_parallel 18x18x18x36.chklat`
Medium	36	72	`reload_parallel 36x36x36x72.chklat`
Large	72	144	`reload_parallel 72x72x72x144.chklat`
Grand Challenge	144	288	`reload_parallel 144x144x144x288.chklat`

With this abstraction in place, the next step is determining a problem decomposition. As an example, we consider running the medium benchmark using 8 GPUs. As an example, we now consider running the medium benchmark using 8 GPUs. With a global problem size 36x36x36x72, we decompose the problem onto 8 processes. In general, we seek a local volume that minimizes surface area (e.g., symmetric decompositions), while also minimizing the splitting of the fastest running X dimension. E.g., 1x1x2x4 is superior to 2x2x1x2. Even though they have equal volumes, the former does not split the X dimension and so there are no strided memory access patterns when updating the halos. With a topology 1x1x2x4 this gives each GPU a problem of size 36x36x18x18.

When MILC is built with QMP for communications (recommended), the node topology is specified by the command line flag -qmp-geom 1 1 2 4 for the example above. In addition, it can help to change the decomposition order, as in, t fastest as opposed to x fastest. This can be specified by the command line flags -qmp-logic-map 3 2 1 0 -qmp-alloc-map 3 2 1 0, where "3 2 1 0" corresponds to reversing the order of "x y z t".

When building MILC without QMP (not recommended), the node topology needs to be set explicitly in the input file by adding the line node_geometry 1 2 2 8 below nt (the fifth line).

Another optimization point to be aware of is the node topology: QUDA (and therefore MILC) takes advantage of peer-to-peer connections between GPUs in a node. On systems that are not fully connected, e.g., GPUs on either side of QPI, there can be a large performance impact versus having the GPUs directly connected (via NVLink or PCIe). At startup QUDA will list which process pairs have peer-to-peer connectively allowing visual confirmation of the node topology. This is described on the Multi-GPU Support page, along with environment flags for ideal OpenMPI performance including enabling GPU Direct RDMA.

The following is an example script that was used to launch the medium benchmark. It assumes that the custom input file described above is in the file `in.medium' in the same directory. The example here is for an 8 GPU job as described above, and is modified from the example on the Multi-GPU guidance page:

#!/bin/bash
export UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc                           # select transports
export UCX_MEMTYPE_CACHE=n                                                 # see https://github.com/openucx/ucx/wiki/NVIDIA-GPU-Support
export UCX_RNDV_SCHEME=get_zcopy                                           # improves GPUDirectRDMA performance
export UCX_RNDV_THRESH=131304                                              # your mileage may vary
export CUDA_DEVICE_MAX_CONNECTIONS=1

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.

# enable GDR support
export QUDA_ENABLE_GDR=1

# set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT=13       # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 # set QUDA-MILC solver optimization

# set executable and command line arguments
export EXE="./su3_rhmd_hisq"
export ARGS="in.medium -qmp-geom 1 1 2 4 -qmp-alloc-map 3 2 1 0 -qmp-logic-map 3 2 1 0"

# call mpirun
# most should be self explanatory; `--bind-to none` is there
# because `numactl` in the wrapper script handles the necessary bindings.
mpirun -np 8 -npernode 8 --bind-to none ./run.sh

LEGACY: Single GPU / single CPU systems

For running on systems with one GPU and one CPU per node, e.g., Piz Daint, we can use the run scripts as provided with minor modifications. We set the geometry with an additional parameter to the binary, and parametrize this with new parameters px, py, pz, and pt which are now set in the appropriate run.sh scripts.

The listings below show the modified milc_in.sh (in MILC_APEX/benchmarks directory) and run_large.sh (in MILC_APEX/benchmarks/large) scripts required for running the large problem MILC benchmark on 96 nodes of Piz Daint using QUDA. The required environmental variables for running with QUDA are set in run_large.sh.

Listing 1: Modified milc_in.sh script that parametrizes the process topology. Additions are marked in green.

#!/bin/bash
#
# The following need to be defined before calling run_milc
# nx, ny, nz, nt - problem dimensions
# warms - # of warmups to perform
# trajecs - # of trajectories to run
# checklat - lattice checkpoint file name
# save_cmd - set to "forget" or "save_serial $checklat"
# reload_cmd - set to "reload_parallel $checklat" or "continue"
#
# choose which command to execute by setting run_type before calling
function run_milc() {
    if [ "$run_type" == "srun" ]; then
-     command="srun -n $N --ntasks-per-socket=$S ./su3_rhmd_hisq"
+     command="srun -n $N --ntasks-per-socket=$S ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
    elif [ "$run_type" == "aprun" ]; then
-     command="aprun -n $N -S $S -d $threads_per_rank -cc numa_node ./su3_rhmd_hisq"
+     command="aprun -n $N -S $S -d $threads_per_rank -cc numa_node ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
    else
-     command="mpirun -n $N -env I_MPI_PIN_DOMAIN socket -env OMP_NUM_THREADS=$threads_per_rank ./su3_rhmd_hisq"
+     command="mpirun -n $N -env I_MPI_PIN_DOMAIN socket -env OMP_NUM_THREADS=$threads_per_rank ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
    fi
    echo "$0: Running \"${command}\" "
    if [ "$debug" == "true" ]; then
      command="cat ";
      echo "Debugging script with cat"
    fi
    [...]

Listing 2: Modified run_large.sh script, suitable for 96 nodes of Piz Daint.

#!/bin/bash -l
#SBATCH --partition=regular
#SBATCH --nodes=96
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=12
#SBATCH --time=04:00:00
#SBATCH --job-name=milc-large
#SBATCH --partition=normal
#SBATCH --constraint=gpu

+export QUDA_ENABLE_DEVICE_MEMORY_POOL=1 # to fit large problem on 32 nodes set this to 0
+export QUDA_MILC_HISQ_RECONSTRUCT=13 # set QUDA-MILC solver optimization
+export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 # set QUDA-MILC solver optimization
+export QUDA_RESOURCE_PATH=. # set path for storing tunecache file
+export QUDA_ENABLE_GDR=1 # enable GPU Direct RDMA (no benefit on Cray)
+export CUDA_DEVICE_MAX_CONNECTIONS=1

+export MPICH_RDMA_ENABLED_CUDA=1 # enable GPU Direct RDMA (no benefit on Cray)
+export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
+module load daint-gpu

  # load the run_milc() function
  . milc_in.sh

  #----- START USER DEFINED PARAMETERS -----#
  #build_lattice=true        # Set only if you want to create a lattice file
                            # Comment out otherwise
                            # You really only want to do this once

  total_nodes=96             # match above allocation request!
  cores_per_node=12
  numa_per_node=1
  threads_per_rank=12
  run_type=srun            # comment out to debug script

  # Parse command line parameters
  debug="false"
  while [ "$#" -ge 1 ] ; do
    case "$1" in
      "--build" )           build_lattice=true; shift 1;;
      "--debug" )           debug="true"; shift 1;;
      "--cores" | "-c" )    cores_per_node=$2; shift 2;;
      "--threads" | "-t" )  threads_per_rank=$2; shift 2;;
      *)                    break;;
    esac
  done

  #----- END USER DEFINED PARAMETERS -----#

  # Set problem size
  nx=72
  ny=72
  nz=72
  nt=144

+ # set process topology: custom addition
+ px=2
+ py=2
+ pz=3
+ pt=8

 [...]