Twisted clover deflated multigrid - lattice/quda GitHub Wiki

For this case study we are using a 64x64x64x128 configuration as provided by the TMC collaboration.

Parameter Value
Volume 64x64x64x128
Gauge action Iwasaki
beta 1.778
Fermion action Twisted Clover
kappa 0.139427
mu 0.000720
CSW 1.69

We will start with a plain Twisted Clover Multigrid solve and use this as a yard stick by which all subsequent improvements shall be compared. Said improvements are:

  1. Using Communication avoiding version of the CG/GCR solver on the coarse levels
  2. Deflating the coarsest level using a Hermitian eigensolver

Each of these steps requires some judicious tuning, which we will demonstrate explicitly with example commands. Please note that throughout this exercise, we will be working with the QUDA develop branch and the following performance critical CMake options:

QUDA_MULTIBLAS_N=4
QUDA_DYNAMIC_CLOVER=ON

The computation will be performed on 16, 32, and 64 Nodes of SUMMIT. We will utilise only 4 of the 6 available GPUs per node, which is a total of 64, 128, and 256 NVIDIA V100 GPUs respectively.

BASELINE: Twisted Clover Multigrid

To begin, we will use QUDA Twisted Clover Multigrid inverter with previously tuned parameters. This represents the 'best case' scenario in which the user has scaled the value of mu on the coarse levels as best as they can. Standard CG will be used as the coarse level solver. The following incantation was used:

command="jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test \
       --prec double --prec-sloppy single --prec-null half --prec-precondition half \
       --dim 64 32 16 16 --gridsize 1 2 4 4 --load-gauge /gpfs/alpine/scratch/howarth/lgt104/conf_latest.0500 \
       --kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 \
       --recon 12 --recon-sloppy 8 --recon-precondition 8 \
       --dslash-type twisted-clover --compute-clover true --tol 1e-10 \
       --mg-levels 3 --mg-block-size 1 4 4 4 8 --mg-block-size 0 2 2 2 2 \
       --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \
       --mg-mu-factor 2 70.0 \
       --nsrc 10 \
       --niter 250 \
       --verbosity verbose \
       --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4 \
       --mg-coarse-solver 1 gcr --mg-coarse-solver-tol 1 0.35 --mg-coarse-solver-maxiter 1 10 \
       --mg-coarse-solver 2 gcr --mg-coarse-solver-tol 2 0.01 --mg-coarse-solver-maxiter 2 20 \
       --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose "

which we shall breakdown in full.

jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test \

SUMMIT specific commands and the executable

       --prec double --prec-sloppy single --prec-null half --prec-precondition half \

These are the precisions used by the various stages of the solver. The final result prec is computed in double precision. The sloppy precision is used to perform globals sums or computations in which only a good estimate is sufficient. null and precondition are the precisions used by the multigrid solver. This already represents a significant part of QUDA's total acceleration strategy, as is documented here.

       --dim 64 32 16 16 --gridsize 1 2 4 4 --load-gauge /gpfs/alpine/scratch/howarth/lgt104/conf_latest.0500 \
       --kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 \
       --dslash-type twisted-clover --compute-clover true --tol 1e-10 \

These are simply problem parameters and the GPU partitioning. Notice that the --clover-coeff parameter is the product of the CSW and kappa parameters.

       --recon 12 --recon-sloppy 8 --recon-precondition 8 \

recon parameters govern how the SU(3) gauge field elements are treated. 18 instructs QUDA to communicate all 18 real elements of SU(3) to the registers. 12 and 8 use different reconstruction strategies to communicate fewer real numbers, and reconstruct the full 18 reals of SU(3) on the fly.

       --mg-levels 3 --mg-block-size 1 4 4 4 8 --mg-block-size 0 2 2 2 2 \
       --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \
       --mg-mu-factor 2 70.0 \
       --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 0 --mg-nu-post 1 4 \
       --mg-coarse-solver 1 gcr --mg-coarse-solver-tol 1 0.35 --mg-coarse-solver-maxiter 1 10 \
       --mg-coarse-solver 2 gcr --mg-coarse-solver-tol 2 0.01 --mg-coarse-solver-maxiter 2 20 \
       --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose 

These are the multigrid specific parameters. Notice that we are using the standard CG, and we increase the value of mu on the coarsest level by a factor of 70. On the coarse levels we use the standard GCR and, as can be seen, the coarsest level solve must be performed to relatively tight tolerance compared to the intermediate level.

       --nsrc 16 \
       --niter 250 \
       --verbosity verbose \

Here we simply instruct QUDA to perform 16 inversions so that we can get a good average, to error out at 250 outer iterations of the solver, and to give verbose output. Data from this command is shown in comparison to the improvements gained from using CA solver variants in the next section. For the time being, we show the spectrum of the coarse grid operator in its natural state, and after mu-scaling. We see that the mu-scaling does indeed add a constant shift to a vast majority of the spectrum, but at the very low end there are some irregularities.

100

1500

IMPROVEMENT 1: Using CA Solvers

QUDA's CA solvers are based on the work in this paper.

We shall now utilise the communication avoiding variants of CG and GCR throughout the multigrid solve. This replacement is straightforward to use in QUDA, simply switch cg and gcr for ca-cg and ca-gcr. This does come with some extra parameters that we must be aware of. One such important parameter is the number of 'inner iterations' to perform in the communication avoiding routines before initiating a communication to perform a correction, or the number of s-steps.

In the case of standard CG, one requires new, orthogonal steps at each iteration of the solver. This requires a matrix-vector product and, on a multi GPU partition, will involve some data communication. This can pose a serious bottleneck to wall clock time to solution. To combat this, one can make good approximations to the new, orthogonal directions using only the data on the node. One can do this s number of times before the orthogonality of the space (and therefore the effectiveness of the algorithm) is lost. When employing these communication avoiding algorithms, the size of the s basis is an important parameter to tune.

In QUDA, one controls this parameter via the QudaInvertParam structure element inv_param.gcrNkrylov which is passed via the command line flag --ngcrkrylov. However, when using ca-cg or ca-gcr as the coarse solver in a multigrid solver, one uses the special parameter mg_param.setup_ca_basis_size[level] or command line flag --mg-setup-ca-basis-size <level> N. In each case, level is the coarse level value, and N is the basis size. Below we give a command and two plots. The command shows how to invoke the communication avoiding variants of CA and GCR. The first plot shows the wall clock converge times when using various values of N. The second plot shows the overall effect of using a CA solvers when using well chosen value of N. In this particular case that value was 10. Remember, it is important to tune this number not only for your problem but also each different MPI partition size you use.

command="jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test --recon 12 --recon-sloppy 12 \
       --prec double --nsrc 16 --dslash-type twisted-clover --compute-clover true --niter 30000 --verify true \
       --dim 64 32 16 16 --gridsize 1 2 4 8 --load-gauge /ccs/proj/lgt104/howarth/configs/conf_latest.0500 \
       --kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 --rank-order row --verbosity verbose --tol 1e-9 \
       --prec-sloppy single --prec-precondition half --prec-null half --recon-precondition 12 \
       --pipeline 16 --reliable-delta 1e-5 --ngcrkrylov 30 \
       --mg-levels 3 --mg-block-size 0 4 4 4 4 --mg-block-size 1 2 2 2 2 \
       --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg \
       --mg-setup-inv 1 cg --mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr \
       --mg-verbosity 0 summarize --mg-verbosity 1 summarize --mg-verbosity 2 summarize --mg-mu-factor 2 70.0 \
       --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr \
       --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 2 --mg-nu-post 1 2 \
       --mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 4 \
       --mg-coarse-solver-maxiter 1 8 --mg-coarse-solver-maxiter 2 8 \
       --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1 "

In this first plot we see that good convergence starts to appear when we use a basis size of 10-16. Outside of these values, convergence suffers. CA1

Here we see the effect of using a CA solver on the coarse grid. The speed up is significant, at around 1.5x for the 16 node partition, and 1.8x for the 64 node partition. Evidently, when using large partitions, it pays to avoid communication! CA2

IMPROVEMENT 2: Using Coarse Level Deflation

The reason behind using a large mu value on the coarsest level is to push the real, low eigenvalues away from zero. This improves the condition number of the coarse grid operator to some degree, but at the same time this artificially transformed operator loses coherence with the finer grid. A simple solution to this problem is to explicitly deflate the low eigenmodes on the coarse grid. This is a surprisingly inexpensive strategy as the coarse grid solves are always done in low precision, hence the FLOPS are high, and the tolerance of the eigenmodes need not be too tight.

There are two portions to this strategy to be aware of. The first is the setup time of the solver, which is largely dependent of the number of eigenmodes one requests. It is also highly dependent on the Cheyshev parameters one chooses. A worked example on Chebyshev acceleration of Restarted Lanczos is given so that the user can familiarise themselves if needed. The second is the amount of time it takes to deflate the coarse solve, which is wholly dependent on the requested number of eigenmodes. With this in mind, we shall deflate the coarse grid with several values of requested eigenmodes which will demonstrate the diminishing returns. Throughout we shall employ the CA-CG and CA-GCR solvers and use as few nodes (16) as possible to alleviate strong scaling violations.

command="jsrun -n16 -r1 -g4 -a4 -c4 -l gpu-gpu ./multigrid_invert_test --recon 12 --recon-sloppy 12 \
       --prec double --nsrc 16 --dslash-type twisted-clover --compute-clover true --niter 30000 --verify true \
       --dim 64 32 16 16 --gridsize 1 2 4 8 --load-gauge /ccs/proj/lgt104/howarth/configs/conf_latest.0500 \
       --kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 --rank-order row --verbosity verbose --tol 1e-9 \
       --prec-sloppy single --prec-precondition half --prec-null half --prec-eigensolver single --recon-precondition 12 \
       --pipeline 16 --reliable-delta 1e-5 --ngcrkrylov 30 \
       --mg-levels 3 --mg-block-size 0 4 4 4 4 --mg-block-size 1 2 2 2 2 \
       --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg \
       --mg-setup-inv 1 cg --mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr \
       --mg-verbosity 0 summarize --mg-verbosity 1 summarize --mg-verbosity 2 summarize --mg-mu-factor 2 70.0 \
       --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr \
       --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 2 --mg-nu-post 1 2 \
       --mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 4 \
       --mg-coarse-solver-maxiter 1 8 --mg-coarse-solver-maxiter 2 8 \
       --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1 \
       --mg-eig 2 false --mg-eig-type 2 trlm --mg-nvec 2 1536 --mg-eig-n-ev 2 1536 --mg-eig-n-kr 2 3072 \
       --mg-eig-batched-rotate 2 100 --mg-eig-tol 2 1e-4 \
       --mg-eig-poly-deg 2 50 --mg-eig-amin 2 1e-0 --mg-eig-amax 2 8.0 \
       --mg-eig-max-restarts 2 25 --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true "

In the plot we see a clear winner: 512 deflation vectors. When we use more that 512 vectors we deflate more of the low modes, which helps algorithmically (the coarse solver iteration count will decrease.) However, as we increase the size of the deflation space, it takes more and more time to perform the deflation. As a result, using 1024 and 2048 vectors decreases the performance of the solver. Deflation

⚠️ **GitHub.com Fallback** ⚠️