Staggered Multigrid Solver - lattice/quda GitHub Wiki

This page is made to be a work-in-progress on how to compile, test, and tune the staggered_invert_test test routine, ensuring that the --inv-multigrid true flag is passed.

The content on this page generally assumes you've read the page on Wilson-type multigrid so you have a familiarity with the vocabulary used here.

Compiling

This compile command assumes a machine with and NVIDIA Ampere GPUs. If you're running with a different architecture, modify the -DQUDA_GPU_ARCH and other flags appropriately. This command also assumes you want to take advantage of the automated USQCD software download and compile.

This command assumes the source directory lives in the environment variable ${QUDA_SRC_DIRECTORY}.

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_GPU_ARCH=sm_80 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
  -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" ${QUDA_SRC_DIRECTORY}
make -j
make install

Please note that make install does not attempt to install to /usr/local or someplace of the sort; it will install it to the build directory.

Quick, context-free example solve command

For copy+paste convenience of generating a configuration and running staggered MG; a further description of the command is given below.

mpirun -np 1 ./heatbath_test --dim 16 16 16 16 --save-gauge l16t16b7p0 \
  --heatbath-beta 7.0 --heatbath-coldstart true --heatbath-num-steps 10 --heatbath-warmup-steps 1000

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.01 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 --load-gauge l16t16b7p0 \
  --dslash-type asqtad --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

Optimized Kahler-Dirac Operator

The default option for staggered aggregation is the optimized Kahler-Dirac preconditioned operator, which is implemented by applying the staggered/asqtad fine stencil, then applying the Kahler-Dirac inverse to the output spinor. This can be verbosely set via the flag --mg-staggered-coarsen-type kd-optimized. It requires solving the level 1 operator (the Kahler-Dirac preconditioned operator) with --mg-coarse-solve-type 1 direct, as the operator has a "direct" implementation, that is, it is applied to full spinors (as opposed to single parity). The operator also requires --mg-block-size 0 1 1 1 1, since the KD operator still acts on the full volume of a fine spinor. Likewise, it requires --mg-nvec 0 3, again corresponding to how it still acts on Nc = 3 fine spinors.

In practice, there are plenty of values that are best kept fixed. Values that need to be tuned are given environment variables in the command and then discussed below.

./staggered_invert_test --inv-multigrid true --dim ${LOCAL_DIM} --gridsize ${GRIDSIZE} --verbosity verbose --nsrc 1 \
  --dslash-type asqtad --solve-type ${OUTER_OPERATOR} --solution-type mat --compute-fat-long true \
  --load-gauge ${GAUGE_PATH} --mass ${MASS} --tadpole-coeff ${TADPOLE_COEFF} --tol ${TOL} \
  --prec double --prec-sloppy single --prec-precondition half --prec-null half \
  --recon 13 --recon-sloppy 13 --recon-precondition 9 \
  --mg-levels 4 --mg-coarse-solve-type 0 ${OUTER_OPERATOR} --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-smoother-solve-type 0 ${OUTER_OPERATOR} --mg-smoother 0 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-smoother-tol 0 1e-10  \
  --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 5e-2 \
  --mg-coarse-solver-maxiter 1 16 --mg-coarse-solver 1 gcr \
  --mg-setup-inv 1 cgnr --mg-setup-maxiter 1 1000 --mg-setup-tol 1 1e-6 \
  --mg-block-size 1 ${FIRST_AGGREGATE} --mg-nvec 1 64 --mg-n-block-ortho 1 2 \
  --mg-smoother-solve-type 1 direct --mg-smoother 1 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 2 --mg-smoother-tol 1 1e-10 \
  --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 \
  --mg-coarse-solver-maxiter 2 16   --mg-coarse-solver 2 gcr \
  --mg-setup-inv 2 cgnr --mg-setup-maxiter 2 1000 --mg-setup-tol 2 1e-6 \
  --mg-block-size 2 ${SECOND_AGGREGATE} --mg-nvec 2 96 --mg-n-block-ortho 2 2 \
  --mg-smoother-solve-type 2 direct-pc --mg-smoother 2 ca-gcr --mg-nu-pre 2 0 --mg-nu-post 2 2 --mg-smoother-tol 2 1e-10 \
  --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 2e-2 \
  --mg-coarse-solver-maxiter 3 1024   --mg-coarse-solver 3 cgnr

This setup takes advantage of recursive preconditioning on each subsequent coarse level. In contrast to the 2-d paper, this was found to be ideal.

Notes and Advanced Parameters

At this time, each aggregation length for the optimized Kahler-Dirac operator (both naive staggered and asqtad) must be even.

--mg-staggered-coarsen-type kd-optimized-drop-long

By default, the Kahler-Dirac preconditioned operator corresponding to the asqtad operator will contain both the fat links and the long links; likewise, both types of links will be aggregated when coarsening the HISQ Kahler-Dirac operator. You can opt into using the Kahler-Dirac preconditioned naive staggered (i.e., one-hop only) instead by using the flag --mg-staggered-coarsen-type kd-optimized-drop-long.

--mg-staggered-allow-drop-long true

When an aggregation dimension is smaller than 3, the long links cannot be safely aggregated because they would generate a two-hop term on the coarse operator, which breaks the Schur decomposition. This results in a runtime error. You can opt into dropping the long links during the aggregation with the explicit flag --mg-staggered-allow-drop-long true.

As a worked example: the flags --mg-block-size 1 4 4 4 2 --mg-staggered-allow-drop-long true correspond to aggregating both the one-hop links and long links in the x, y, and z dimensions, then only aggregating the one-hop term in the t direction. Without the --mg-staggered-allow-drop-long true flag, this will lead to a runtime error.

Note that --mg-staggered-coarsen-type kd-optimized-drop-long implies long links are dropped -> --mg-staggered-allow-drop-long true.

--mg-staggered-kd-dagger-approximation true

In the free field, the Kahler-Dirac inverse block Xinv is proportional to the dagger of the Kahler-Dirac block X. While this form has not been rigorously investigated, it can be tested by setting --mg-staggered-kd-dagger-approximation true

Environment variables:

${LOCAL_DIM} and ${GRIDSIZE}

On a 16 GiB GPU, with managed memory and prefetching enabled (export QUDA_ENABLE_MANAGED_MEMORY=1; export QUDA_ENABLED_MANAGED_PREFETCH=1) a good goal is to target a local volume of O(32x32x32x32), with a rough upper bound of 36^2x32^3. With managed memory support off, a local volume of 24x24x24x24 or 32x32x24x16 is about as large as you can go. With prefetching support there's generally no need to use this small of a volume.

Due to constraints in implementation and the nature of the staggered MG algorithm, each local dimension must be divisible by 6 for HISQ MG or 4 for naive staggered. In addition, the local dimensions on all coarse levels need to be even.

${TOL}, ${GAUGE_PATH}, ${MASS}, ${TADPOLE_COEFF}

Self explanatory.

${OUTER_OPERATOR}

The optimized operator supports using both the direct operator, that is, the full parity staggered/asqtad operator, and the direct-pc operator, that is, the Schur operator. The latter case works by constructing a full parity linear system from the Schur system before restricting the fine residual, and prolongating the error correct only to the fine sites.

For the time being, it does not seem like there is a benefit to using the direct-pc operator unless absolutely necessary. This is because a solve of the full system to a tolerance of ${TOL} requires a solve to a tolerance of ${MASS} * ${TOL} for the Schur system, which requires more iterations.

${FIRST_AGGREGATE} and ${SECOND_AGGREGATE}

These two variables inform the aggregation sizes going from level 2 to 3 (pseudo-fine to intermediate) and level 3 to 4 (intermediate to coarsest). As a note, due to the nature of the staggered MG algorithm, the level 1 to level 2 aggregate (fine to pseudo-fine) is fixed to 1 1 1 1 1, that is, a lack of aggregation, owing to the optimized KD operator still acting on the fine lattice. Note that all numbers in FIRST_AGGREGATE must be even.

Some worked examples (up to permutations):

Fine local volume 24x24x24x24: FIRST_AGGREGATE="6 6 6 4", SECOND_AGGREGATE="2 2 2 3"

Fine local volume 16x32x32x24: FIRST_AGGREGATE="4 8 8 6", SECOND_AGGREGATE="2 2 2 2"

Fine local volume 48 48 12 12: FIRST_AGGREGATE="4 8 6 6", SECOND_AGGREGATE="6 3 1 1"

Fine local volume 24x48x24x36: FIRST_AGGREGATE="6 8 6 6", SECOND_AGGREGATE="2 3 2 3"

Fine local volume 32x32x32x32: FIRST_AGGREGATE="8 8 8 8" or "8 8 8 4", SECOND_AGGREGATE="2 2 2 2"

Some intuition to work with when setting your own:

It's more instructive to first set the intermediate to coarsest aggregate, then work backwards to find the pseudo-fine to intermediate aggregate. The coarsest level should have a local volume of roughly 2x2x2x2 (odd dimensions are not allowed), and the intermediate to coarsest aggregate should be a site thinning of roughly a factor of 16. For this reason, ${SECOND_AGGREGATE} should be (up to a permutation) something along the lines of 2 2 2 2, 4 2 2 1, 2 3 2 2, etc.

Once ${SECOND_AGGREGATE} has been determined, ${FIRST_AGGREGATE} follows by figuring out what takes you from the fine to intermediate level consistent to what the intermediate level local volume should be via the process above.

Running with CG

As a point of comparison, it's useful to also perform a solve on the staggered operator {{m, D_eo},{D_oe,m}} via a CG solve on the Schur preconditioned operator m^2 - D_eo D_oe (covered via --test 1 in the command below). Since the staggered MG algorithm is a solve on the full operator, an MG solve to a relative residual tolerance ${TOL} needs to be compared to a CG solve to a relative residual tolerance of ${TOL} times ${MASS}. In bash, this can be computed by CGTOL=$(echo "$MASS $TOL" | awk ' { print $1*$2; } '), which we include below.

CGTOL=$(echo "$MASS $TOL" | awk ' { print $1*$2; } ')
./staggered_invert_test --dim ${LOCALSIZE} --gridsize ${GRIDSIZE} --verbosity verbose --nsrc 1 \
--dslash-type asqtad --test 1 --compute-fat-long true \
--load-gauge ${GAUGE_PATH} --mass ${MASS} --tadpole-coeff ${TADPOLE_COEFF} --tol ${CGTOL} \
--prec double --recon 13 --prec-sloppy half --recon-sloppy 9

Running with deflation on the coarsest level

When running with deflation on the coarsest level, most of the command line arguments remain unchanged. The primary difference is the CGNR solve on the lowest level is replaced with an SVD-deflated CA-GCR solve, which is described in more detail below.

./staggered_invert_test --inv-multigrid true --dim ${LOCAL_DIM} --gridsize ${GRIDSIZE} --verbosity verbose --nsrc 1 \
  --dslash-type asqtad --solve-type ${OUTER_OPERATOR} --solution-type mat --compute-fat-long true \
  --load-gauge ${GAUGE_PATH} --mass ${MASS} --tadpole-coeff ${TADPOLE_COEFF} --tol ${TOL} \
  --prec double --prec-sloppy single --prec-precondition half --prec-null half \
  --recon 13 --recon-sloppy 13 --recon-precondition 9 \
  --mg-levels 4 --mg-coarse-solve-type 0 ${OUTER_OPERATOR} --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-smoother-solve-type 0 ${OUTER_OPERATOR} --mg-smoother 0 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-smoother-tol 0 1e-10  \
  --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 5e-2 \
  --mg-coarse-solver-maxiter 1 16 --mg-coarse-solver 1 gcr \
  --mg-setup-inv 1 cgnr --mg-setup-maxiter 1 1000 --mg-setup-tol 1 1e-5 \
  --mg-block-size 1 ${FIRST_AGGREGATE} --mg-nvec 1 64 --mg-n-block-ortho 1 2 \
  --mg-smoother-solve-type 1 direct --mg-smoother 1 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 2 --mg-smoother-tol 1 1e-10 \
  --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 \
  --mg-coarse-solver-maxiter 2 16   --mg-coarse-solver 2 gcr \
  --mg-setup-inv 2 cgnr --mg-setup-maxiter 2 1000 --mg-setup-tol 2 1e-5 \
  --mg-block-size 2 ${SECOND_AGGREGATE} --mg-nvec 2 96 --mg-n-block-ortho 2 2 \
  --mg-smoother-solve-type 2 direct-pc --mg-smoother 2 ca-gcr --mg-nu-pre 2 0 --mg-nu-post 2 2 --mg-smoother-tol 2 1e-10 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solver-ca-basis-size 3 16 --mg-coarse-solver-maxiter 3 16 \
  --mg-eig 3 true --mg-eig-type 3 trlm --mg-eig-use-dagger 3 false --mg-eig-use-normop 3 true \
  --mg-nvec 3 ${COARSEST_NVEC} --mg-eig-n-ev 3 ${COARSEST_NEV} --mg-eig-n-kr 3 ${COARSEST_NKR} --mg-eig-tol 3 1e-4 \ 
  --mg-eig-poly-deg 3 ${COARSEST_POLY_DEG} --mg-eig-amin 3 ${COARSEST_AMIN} --mg-eig-amax 3 ${COARSEST_AMAX} \
  --mg-eig-max-restarts 3 1000

Environment variables:

${COARSEST_NVEC}, ${COARSEST_NEV}, ${COARSEST_NKR}, ${COARSEST_POLY_DEG}, ${COARSEST_AMIN}, ${COARSEST_AMAX}

Before digging into the details of coarsest level deflation, I've found that a good goal is to deflate enough singular vectors such that CA-GCR hits a residual tolerance of 0.25 in one pass, i.e., a single hit of CA-GCR(16). This can ultimately be verified by changing --mg-coarse-solver-maxiter 3 16 to --mg-coarse-solver-maxiter 3 1024 and confirming the coarsest solve generally converges in one pass.

From there, assuming severe ill-conditioning, I've found that a good heuristic for determining the number of eigenvectors to deflate is using COARSEST_NVEC=128 for a 48^3 x 96 volume, and taking advantage of naive density of states scaling to modify that number of other volumes (i.e., scale by the volume: use roughly COARSEST_NVEC=384 for a 64^3 x 128 volume, etc). Your mileage may vary, though, especially if the operator isn't painfully ill-conditioned.

Once you've determined COARSEST_NVEC, setting COARSEST_NEV to COARSEST_NVEC+2 and COARSEST_NKR to 1.5*COARSEST_NVEC tends to work empirically.

The last step is determining parameters for Chebyshev polynomial acceleration. Pending the merge of #933, a good way to determine COARSEST_AMAX is performing a non-accelerated Lanczos, looking for the largest singular value, then rescaling that by a factor of 1.1 to be safe. This constitutes replacing the bottom 3 lines of the command above with --mg-nvec 3 64 --mg-eig-n-ev 3 66 --mg-eig-n-kr 3 198 --mg-eig-tol 3 1e-4 --mg-eig-use-poly-acc 3 false --mg-eig-max-restarts 3 1000 --mg-eig-spectrum 3 LR.

A good value for COARSEST_AMIN depends on the number of eigenvalues you intend to deflate. The eigensolver will not converge if the largest of the COARSEST_NVEC smallest singular values of the operator in question is larger than COARSEST_AMIN. In general, COARSEST_AMIN=1e-1 is a good conservative estimate, though 1e-2 is a bit more ideal.

The last step is setting the order of the Chebyshev polynomial via COARSEST_POLY_DEG. This parameter has two effects: first, it reduces the number of restarts in the TRLM (at the expense of increasing the number of linear operator applications per restart). Second, it increases the ability of the TRLM to resolve near-degenerate eigenvalues. As a rough heuristic, a TRLM will struggle to resolve two eigenvalues of the polynomial accelerated operator which differ by less than the target residual of the eigensolve. The larger the order of the polynomial, the more near-degenerate eigenvalues get "pulled apart". This has an implication with the volume dependence of the eigensolver: due to an increase of the density of states, larger volumes generally require a higher order polynomial.

A further indicator that you should increase the order of your Chebyshev polynomial is monitoring the rate of convergence of the TRLM. If the number of converged eigenvalues monotonically increases per restart, your Chebyshev polynomial is high enough order. If it stalls, or does not increase monotonically, you should increase the order of the polynomial.

Explicit Coarse Kahler-Dirac

The original implementation of the Kahler-Dirac preconditioned operator explicitly built an (Nc = 24 x Ns = 2) operator, reflecting how a 2^4 hypercube of fine sites corresponds to a single Kahler-Dirac block with (2^4 x Nc == 3) degrees of freedom. While this greatly inflates memory requirements, there are some cases where it is still useful because the operator supports a Schur decomposition on super-sites.

The key option that enables using the explicit coarse operator is --mg-staggered-coarsen-type kd-coarse. It requires solving the level 1 operator (the Kahler-Dirac preconditioned operator) with --mg-coarse-solve-type 1 direct-pc, as the preconditioning imples the block-Jacobi preconditioning of the staggered operator. The operator (intuitively) requires --mg-block-size 0 2 2 2 2, the super-site aggregate, and --mg-nvec 0 24`, corresponding to the Nc = 24 degrees of freedom per super-site (times 2 for fine even/odd => coarse chirality).

Note that the coarse KD operator drops the long links in the asqtad/hisq case, as the long links would give 2-hop contributions to the super-site operator, breaking the Schur decomposition. This is also related to why we use rather aggressive smoothing on the outer level, 8 iterations instead of 2, to correct for the perturbative effect of dropping the long links.

A reference command is given by:

./staggered_invert_test --inv-multigrid true --dim ${LOCAL_DIM} --gridsize ${GRIDSIZE} --verbosity verbose --nsrc 1 \
  --dslash-type asqtad --solve-type direct --compute-fat-long true \
  --load-gauge ${GAUGE_PATH} --mass ${MASS} --tadpole-coeff ${TADPOLE_COEFF} --tol ${TOL} \
  --prec double --prec-sloppy single --prec-precondition half --prec-null half \
  --recon 13 --recon-sloppy 13 --recon-precondition 9 \
  --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-coarse \
  --mg-block-size 0 2 2 2 2 --mg-nvec 0 24 \
  --mg-smoother-solve-type 0 direct --mg-smoother 0 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-smoother-tol 0 1e-10  \
  --mg-coarse-solve-type 1 direct-pc --mg-coarse-solver-tol 1 5e-2 \
  --mg-coarse-solver-maxiter 1 16 --mg-coarse-solver 1 gcr \
  --mg-setup-inv 1 cgnr --mg-setup-maxiter 1 1000 --mg-setup-tol 1 1e-6 \
  --mg-block-size 1 ${FIRST_AGGREGATE} --mg-nvec 1 64 --mg-n-block-ortho 1 2 \
  --mg-smoother-solve-type 1 direct-pc --mg-smoother 1 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 2 --mg-smoother-tol 1 1e-10 \
  --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 \
  --mg-coarse-solver-maxiter 2 16   --mg-coarse-solver 2 gcr \
  --mg-setup-inv 2 cgnr --mg-setup-maxiter 2 1000 --mg-setup-tol 2 1e-6 \
  --mg-block-size 2 ${SECOND_AGGREGATE} --mg-nvec 2 96 --mg-n-block-ortho 2 2 \
  --mg-smoother-solve-type 2 direct-pc --mg-smoother 2 ca-gcr --mg-nu-pre 2 0 --mg-nu-post 2 2 --mg-smoother-tol 2 1e-10 \
  --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 2e-2 \
  --mg-coarse-solver-maxiter 3 1024   --mg-coarse-solver 3 cgnr

Notes on Environment Variables for the Explicit Coarse Operator

${FIRST_AGGREGATE} and ${SECOND_AGGREGATE}

These two variables inform the aggregation sizes going from level 2 to 3 (pseudo-fine to intermediate) and level 3 to 4 (intermediate to coarsest). As noted above, due to the nature of the staggered MG algorithm, the level 1 to level 2 aggregate (fine to pseudo-fine) is fixed to 2 2 2 2, that is, a 2x2x2x2 aggregate. Due to the fact that the first "coarsening" is a 2x2x2x2, the next coarsening should be roughly half the size, per dimension, of coarsenings for the optimized operator.

Some worked examples (up to permutations):

Fine local volume 24x24x24x24 -> pseudo-fine volume 12x12x12x12: FIRST_AGGREGATE="3 3 3 2", SECOND_AGGREGATE="2 2 2 3"

Fine local volume 16x32x32x24 -> pseudo-fine volume 8x16x16x12: FIRST_AGGREGATE="2 4 4 3", SECOND_AGGREGATE="2 2 2 2"

Fine local volume 48x48x12x12 -> pseudo-fine volume 24x24x6x6: FIRST_AGGREGATE="2 4 3 3", SECOND_AGGREGATE="6 3 1 1"

Fine local volume 24x48x24x36 -> pseudo-fine volume 12x24x12x18: FIRST_AGGREGATE="3 4 3 3", SECOND_AGGREGATE="2 3 2 3"

Fine local volume 32x32x32x32 -> pseudo-fine volume 16x16x16x16: FIRST_AGGREGATE="4 4 4 4" or "4 4 4 2", SECOND_AGGREGATE="2 2 2 2"

Some intuition to work with when setting your own:

It's more instructive to first set the intermediate to coarsest aggregate, then work backwards to find the pseudo-fine to intermediate aggregate. On a 16 GB GPU, the coarsest level should have a local volume of 2x2x2x2 (odd dimensions are not allowed), and the intermediate to coarsest aggregate should be a site thinning of roughly a factor of 16. For this reason, ${SECOND_AGGREGATE} should be (up to a permutation) something along the lines of 2 2 2 2, 4 2 2 1, 2 3 2 2, etc.

Aggregation-based Staggered Solve

Note: While QUDA supports a solve based on directly coarsening the staggered operator, it is not recommended due to extreme ill-conditioning of the coarsened operators. The origin of these small eigenvalues is described in arxiv:1801.07823.

Direct Galerkin coarsening of the staggered operator is enabled by the command-line flag --mg-staggered-coarsen-type aggregate, or by setting (QudaMultigridParam).transfer_type[0] = QUDA_TRANSFER_AGGREGATE; via the interface.

Unlike the Kahler-Dirac preconditioned operator, aggregation lengths can be odd.

An example command:

./staggered_invert_test --inv-multigrid true --dim ${LOCAL_DIM} --gridsize ${GRIDSIZE} --verbosity verbose --nsrc 1 \
  --dslash-type asqtad --solve-type direct --compute-fat-long true \
  --load-gauge ${GAUGE_PATH} --mass ${MASS} --tadpole-coeff ${TADPOLE_COEFF} --tol ${TOL} \
  --prec double --prec-sloppy single --prec-precondition half --prec-null half \
  --recon 13 --recon-sloppy 13 --recon-precondition 9 \
  --mg-levels 3 --mg-staggered-coarsen-type aggregate \
  --mg-setup-inv 0 cgnr --mg-setup-maxiter 0 1000 --mg-setup-tol 0 1e-6 \
  --mg-block-size 0 ${FIRST_AGGREGATE} --mg-nvec 0 64 --mg-n-block-ortho 0 2 \
  --mg-smoother-solve-type 0 direct --mg-smoother 0 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 2 --mg-smoother-tol 0 1e-10 \
  --mg-coarse-solve-type 1 direct-pc --mg-coarse-solver-tol 1 0.25 \
  --mg-coarse-solver-maxiter 1 16   --mg-coarse-solver 1 gcr \
  --mg-setup-inv 1 cgnr --mg-setup-maxiter 1 1000 --mg-setup-tol 1 1e-6 \
  --mg-block-size 1 ${SECOND_AGGREGATE} --mg-nvec 1 96 --mg-n-block-ortho 1 2 \
  --mg-smoother-solve-type 1 direct-pc --mg-smoother 1 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 2 --mg-smoother-tol 1 1e-10 \
  --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 2e-2 \
  --mg-coarse-solver-maxiter 2 1024   --mg-coarse-solver 2 cgnr