Multigrid Solver - lattice/quda GitHub Wiki

The feature/multigrid branch contains the present work in progress on implementing the adaptive multigrid multigrid algorithm into QUDA. Once you've checked out this branch, you should enable this with the QUDA_MULTIGRID cmake option. These instructions are primarily intended for the those who are using the QUDA-multigrid interface directly; an alternative approach is to use Chroma which includes a QUDA-multigrid interface for solving both Wilson and Wilson-clover systems.

Note the QUDA Multigrid interface is experimental, and is evolving, and is thus subject to change. For the most up-to-date interface semantics we refer to the canonical invert_test example in the tests directory. Note that in order to utilise the multigrid preconditioner, one must pass the command line flag --inv-multigrid true.

Contents

  1. Setting up the Multigrid solver
  2. Running the Multigrid solver
  3. Example solve using Multigrid
  4. Tuning the Multigrid solver
  5. Multigrid with coarse grid deflation

Setting up the Multigrid Solver

Before we can use the multigrid solver, we first must run the adaptive setup. Here we describe the various parameters involved.

The interface functions that are responsible for managing an instance of the multigrid solver are defined quda.h.

  /**                                                                                                                                                                                                     
   * Setup the multigrid solver, according to the parameters set in param.  It                                                                                                                            
   * is assumed that the gauge field has already been loaded via                                                                                                                                          
   * loadGaugeQuda().                                                                                                                                                                                     
   * @param param  Contains all metadata regarding host and device                                                                                                                                        
   *               storage and solver parameters                                                                                                                                                          
   */
  void* newMultigridQuda(QudaMultigridParam *param);

  /**                                                                                                                                                                                                     
   * @brief Free resources allocated by the multigrid solver                                                                                                                                              
   * @param mg_instance Pointer to instance of multigrid_solver                                                                                                                                           
   */
  void destroyMultigridQuda(void *mg_instance);

  /**                                                                                                                                                                                                     
   * @brief Updates the multigrid preconditioner for the new gauge / clover field                                                                                                                         
   * @param mg_instance Pointer to instance of multigrid_solver                                                                                                                                           
   */
  void updateMultigridQuda(void *mg_instance, QudaMultigridParam *param);

We create the multigrid instance with newMultigridQuda, which is driven by the QudaMultigridParam struct. This struct sets all the metadata responsible for the multigrid solver parameters. newMultigridQuda returns a pointer to the instance of the multigrid solver we will subsequently pass as a preconditioner to invertQuda.

When newMultigridQuda is called, the adaptive multigrid setup will be invoked creating the multigrid preconditioner. On first invocation this can take some time to call due to the autotuning overhead of each individual kernel. The setup phase essentially consists of the following steps on each level (excluding the coarsest level):

  • Compute a set of null-space vectors
  • Block orthonormalize the null-space vectors to define the prolongation and restriction operators
  • Compute the coarse-grid operator from the Galerkin product R A P There are variations on this exact process depending on the nature of the setup parameters, e.g., whether we perform multiple cycles of the setup process, etc.

Once the setup has completed, if the run_verify option is enabled, the algorithm will run various verification tests to ensure basic sanity. E.g.,

  • (1 - P R) v = 0, where v is a null-space vectors (orthonormality)
  • (1 - R P) eta = 0, where eta is a random vector
  • (Ac - R A P) eta = 0, where eta is a random vector (Galerkin condition) If these tests do not pass then the parameters are incorrect or the user has come across a bug. In either case, please file an issue on GitHub.

In the below table we describe each parameter and also note the matching command-line parameter that sets this parameter in invert_test (where applicable). For the parameters that appear in an array of size QUDA_MAX_MG_LEVEL, each parameter can be set for each level independently. For some parameters, it only makes sense to set these [0, QUDA_MAX_MG_LEVEL-1), e.g., the geo_block_size since they concern the connection to the next coarsest level, where as others potentially need to be set for every level [0, QUDA_MAX_MG_LEVEL-1], e.g., --mg-schwarz-type.

General Parameters

Name Description Comment Command-line parameter
QudaInvertParam *invert_param Set the MG solver meta data such as precision, quark mass, Dirac operator type, etc.
int n_level Number of multigrid levels 3 is usually optimal --mg-levels <2+>
int geo_block_size[QUDA_MAX_MG_LEVEL][QUDA_MAX_DIM] Geometric block sizes to use on each level Generally favor more aggressive coarsening for first level, rule of thumb is 4^4 --mg-block-size <level x y z t>
int spin_block_size[QUDA_MAX_MG_LEVEL] Spin block sizes to use on each level 2 for level 0, and 1 thereafter N/A
int n_vec[QUDA_MAX_MG_LEVEL] Number of null-space vectors to use on each level 24 or 32 is supported presently --mg-nvec <level nvec>
QudaPrecision precision_null[QUDA_MAX_MG_LEVEL] Precision to store the null-space vectors and preconditioned coarse-link variables Use QUDA_HALF_PRECISION for optimal performance --prec-null <double/single/half>
QudaVerbosity verbosity[QUDA_MAX_MG_LEVEL] Verbosity on each level of the multigrid --mg-verbosity <level verb>
QudaInverterType coarse_solver[QUDA_MAX_MG_LEVEL] The solver that wraps around the coarse grid correction and smoother Set for all levels except 0. Suggest using QUDA_GCR_INVERTER on all intermediate grids and QUDA_CA_GCR_INVERTER on the bottom. --mg-coarse-solver <level gcr/etc.>
double coarse_solver_tol[QUDA_MAX_MG_LEVEL] Tolerance for the solver that wraps around the coarse grid correction and smoother Suggest setting each level to 0.25 --mg-coarse-solver-tol <level gcr/etc.>
double coarse_solver_maxiter[QUDA_MAX_MG_LEVEL] Tolerance for the solver that wraps around the coarse grid correction and smoother Suggest setting in the range 8-100 --mg-coarse-solver-maxiter <level n>
QudaInverterType smoother[QUDA_MAX_MG_LEVEL] Smoother to use on each level Set to QUDA_CA_GCR_INVERTER for each level --mg-smoother <level mr/ca-gcr/etc.>
double smoother_tol[QUDA_MAX_MG_LEVEL] Tolerance to use for the smoother / solver on each level Suggest setting each level to 0.25 --mg-smoother-tol <level resid_tol>
int nu_pre[QUDA_MAX_MG_LEVEL] Number of pre-smoother applications on each level Suggest setting to 0 --mg-nu-pre <level 1-20>
int nu_post[QUDA_MAX_MG_LEVEL] Number of post-smoother applications on each level Suggest setting to 8 --mg-nu-post <level 1-20>
double omega[QUDA_MAX_MG_LEVEL] Over/under relaxation factor for the smoother at each level Set to 0.8-1.0 --mg-omega
QudaSchwarzType smoother_schwarz_type[QUDA_MAX_MG_LEVEL] Whether to use additive or multiplicative Schwarz preconditioning in the smoother Experimental, set to QUDA_INVALID_SCHWARZ for each level unless you know what you're doing --mg-schwarz-type <level false/add/mul>
int smoother_schwarz_cycle[QUDA_MAX_MG_LEVEL] Number of Schwarz cycles to apply Experimental, set to 1 for each level --mg-schwarz-cycle <level cycle>
QudaSolutionType coarse_grid_solution_type[QUDA_MAX_MG_LEVEL] The type of residual to send to the next coarse grid, and thus the type of solution to receive back from this coarse grid Use QUDA_MATPC_SOLUTION for all levels; if solving the full unpreconditioned system on level 0 then set coarse_grid_solution_type[0]=QUDA_MAT_SOLUTION Implicitly depending on --mg-solve-type and --solve-type
QudaSolveType smoother_solve_type[QUDA_MAX_MG_LEVEL] The type of smoother solve to do on each grid (e/o preconditioning or not) Suggest setting to QUDA_DIRECT_PC_SOLVE for all levels --mg-solve-type <level solve>
QudaMultigridCycleType cycle_type[QUDA_MAX_MG_LEVEL] The type of multigrid cycle to perform at each level Always set to QUDA_MG_CYCLE_RECURSIVE (this sets the MG cycles to be a K-cycle which is generally superior to a V-cycle for non-Hermitian systems)
QudaBoolean global_reduction[QUDA_MAX_MG_LEVEL] Whether to use global reductions or not for the smoother / solver at each level Experimental, keep set to QUDA_BOOLEAN_YES for all levels unless using a Schwarz smoother
QudaFieldLocation location[QUDA_MAX_MG_LEVEL] Location where each level should be done Set to QUDA_CUDA_FIELD_LOCATION for all levels N/A
QudaComputeNullVector compute_null_vector Whether to compute the null vectors or reload them (or use the free field vectors, see * below) Typically set to QUDA_COMPUTE_NULL_VECTOR_YES --mg-generate-nullspace <true/false>
QudaBoolean generate_all_levels Whether to generate on all levels or just on level 0 Set to QUDA_BOOLEAN_TRUE for all levels (required if nvec is changed between levels) --mg-generate-all-levels <true/false>
double mu_factor[QUDA_MAX_MG_LEVEL] Multiplicative factor for the mu parameter Only applicable for twisted-mass and twisted-clover fermions --mg-mu-factor <level factor>
QudaBoolean run_verify Whether to run the verification checks once set up is complete Keep set to QUDA_BOOLEAN_TRUE unless in final production --verify <true/false>
double gflops The Gflops rate of the multigrid solver setup
double sec The time taken by the multigrid solver setup

The instance of QudaInvertParam sets general metadata that needs to be set when driving a QUDA linear solver. Namely the precision, Dirac operator parameters, etc. Note that regardless of the final solve being done, e.g. whether that of the regular linear system or the even-odd preconditioned system, QudaMultigridParam::invert_param.solve_type should be set to QUDA_DIRECT_SOLVE since the preconditioner is setup based on the full linear operator. With respect to precision, the general advice here is to set cuda_prec = QUDA_DOUBLE_PRECISION, cuda_prec_sloppy = QUDA_SINGLE_PRECISION and cuda_prec_precondition = QUDA_HALF_PRECISION (in addition to setting QudaMultigridParam::prec_null = QUDA_HALF_PRECISION).

Near-null vector generation

There are many options for generating near-null vectors in QUDA:

  • Inverse iterations (default)
  • Chebyshev filter (a la arXiv:2103.05034, P. Boyle and A. Yamaguchi)
  • Eigenvectors
  • Free field vectors
  • Restricting as many near null vectors as possible from the finer level, then generating more from another method (that is --- if level 1 is Nc = 24, and level 2 is Nc = 32, then for level 2 all 24 near-null vectors from level 1 are restricted then 8 more are generated)
  • WIP: Test vector generation, currently disabled

There is also support for "refining" near-null vectors generated by other methods with more iterations of inverse iterations.

The setup type is controlled by QudaNullVectorSetupType setup_type[QUDA_MAX_MG_LEVEL], which can be set to QUDA_SETUP_NULL_VECTOR_INVERSE_ITERATIONS, QUDA_SETUP_NULL_VECTOR_CHEBYSHEV_FILTER, QUDA_SETUP_NULL_VECTOR_EIGENVECTORS, QUDA_SETUP_NULL_VECTOR_TEST_VECTORS, QUDA_SETUP_NULL_VECTOR_RESTRICT_FINE, or QUDA_SETUP_NULL_VECTOR_FREE_FIELD. The equivalent command line parameter is --mg-setup-type [level] [method], where [method] can be one of inverse-iterations (default), chebyshev-filter, eigenvectors, test-vectors, restrict-fine, and free-field.

Of note, pre-existing near-null vectors can be loaded, or generated near-null vectors can be saved, via the variables char vec_infile[256]/char vec_outfile[256], or the command line arguments --mg-load-vec <level file>/--mg-save-vec <level file>. A suffix specifying the level and the number of near-nulls is appended to each filename. When an input file is specified, it takes precedent, and all other generation parameters are ignored. For performance reasons, near-null vectors can be stored in the QIO PARTFILE format using --mg-save-partfile <level boolean>, as described in QIO and QMP.

Two other flags relevant for setup are QudaBoolean pre_orthonomalize and QudaBoolean post_orthonormalize, which control global orthonormalization of initial random vectors and the final near-null vectors before block-orthonormalization. The corresponding command line flag are --mg-pre-orth <true/false> and --mg-post-orth <true/false>; we recommend true.

Inverse iteration setup

Inverse iteration generation works by solving the homogeneous equation, A x = 0, with a random initial guess x_0. The motivation for this method is that iterative solvers converge low modes the most slowly, so after some number of iterations the solution is "rich" in low modes.

Name Description Comment Command-line parameter
QudaInverterType setup_inv_type[QUDA_MAX_MG_LEVEL] Inverter to use in the setup phase QUDA_BICGSTAB_INVERTER or QUDA_CGNR_INVERTER generally preferred --mg-setup-inv <level bicgstab/cgnr/...>
double setup_tol[QUDA_MAX_MG_LEVEL] Tolerance to use for inverse iteration setups 1e-6 for CGNR/CGNE, 1e-5 otherwise. --mg-setup-tol <level tol>
int setup_maxiter[QUDA_MAX_MG_LEVEL] Maximum number of iterations for each setup solver 500-1000 should work for most systems --mg-setup-maxiter <level iter>
int setup_maxiter_refresh[QUDA_MAX_MG_LEVEL] Maximum number of iterations for setup refreshes 100 as a default --mg-setup-maxiter-refresh <level iter>
QudaCABasis setup_ca_basis[QUDA_MAX_MG_LEVEL] Basis to use for CA solver setup QUDA_CHEBYSHEV_BASIS for stability --mg-setup-ca-basis-type <level power/chebyshev>
int setup_ca_basis_size[QUDA_MAX_MG_LEVEL] Basis size for CA solver setup 4, higher requires Chebyshev basis --mg-setup-ca-basis-size <level size>
double setup_ca_lambda_min[QUDA_MAX_MG_LEVEL] Minimum eigenvalue for Chebyshev CA basis 0.0 to be safe --mg-setup-cheby-basis-eig-min <level value>
double setup_ca_lambda_max[QUDA_MAX_MG_LEVEL] Maximum eigenvalue for Chebyshev CA basis -1, triggers power iterations --mg-setup-cheby-basis-eig-max <level value>

An example set of parameters is

--mg-setup-type 1 inverse-iterations --mg-setup-tol 1 5e-7 --mg-setup-inv 1 cgnr

Chebyshev filter setup

Chebyshev filter setup, as described in arXiv:2103.05034, P. Boyle and A. Yamaguchi, first uses a Chebyshev low pass filter on a random vector to make it rich in low modes, then performs additional iterations of the Chebyshev polynomial over the entire eigenspace to generate additional near-null vectors.

The Chebyshev filter has a flexible set of parameters describing generating a set of near-null vectors related to the initial low-pass filter, the number of starting vectors, and subsequent generation from a low-passed starting vector.

Name Description Comment Command-line parameter
int filter_startup_vectors[QUDA_MAX_MG_LEVEL] Number of random starting vectors Enough s.t. 16 near-nulls are generated per start --mg-setup-filter-startup-vectors <level iters>
filter_startup_iterations[QUDA_MAX_MG_LEVEL] Number of iterations for initial low-pass 1000-order polynomial --mg-setup-filter-startup-iterations <level iters>
int filter_startup_rescale_frequency[QUDA_MAX_MG_LEVEL] Rescaling frequency for low-pass 50 iterations --mg-setup-filter-startup-rescale-frequency <level iters>
double filter_lambda_min[QUDA_MAX_MG_LEVEL] Bottom of filter window; eigenvalues below are enhanced 1.0 for sufficient enhancement --mg-setup-filter-lambda-min <level value>
double filter_lambda_max[QUDA_MAX_MG_LEVEL] Top of filter window "-1", to calculate it via power iterations --mg-setup-filter-lambda-max <level value>
int filter_iterations_between_vectors[QUDA_MAX_MG_LEVEL] Number of iterations between generated near-nulls 150 iterations --mg-setup-filter-iterations-between-vectors <level iters>

For the setup vectors, if the number of near-null vectors for level 1 is 24, --mg-setup-filter-startup-vectors 1 1 corresponds to one starting vector with 24 near-nulls generated; [...] 1 3 corresponds to three starting vectors with 3 near-nulls generated from each, etc. In cases like [...] 1 5, where 5 doesn't divide into 24, 5 near-null vectors are generated from the first four starting vectors, and 4 from the last --- 4 * 4 + 4 = 24. The "rescale frequency" is an empirical feature: since the norm of a vector could overflow (or individual values thereof), the vector can be renormalized with some frequency in a way that preserves the Chebyshev recursion.

A reference setup where 4 base vectors are used -> 8 near-null vectors are generated from each base vector, the minimum of the low pass filter is 1.0, a 500 iteration low pass filter with rescaling every 50 iterations is used, and there are 100 iterations between subsequent near-nulls is:

--mg-setup-type 1 chebyshev-filter --mg-setup-filter-startup-vectors 1 4 --mg-setup-filter-startup-iterations 1 500 --mg-setup-filter-startup-rescale-frequency 1 50 --mg-setup-filter-lambda-min 1 1.0 --mg-setup-filter-iterations-between-vectors 1 50

Eigenvectors

The variables and flags for eigenvectors are as described below. We don't necessarily recommend using eigenvectors as near-null vectors as it avoids capturing a rich set of low modes (by construction).

A reference setup without polynomial acceleration is

--mg-setup-type 1 eigenvectors --mg-eig-type 1 trlm --mg-eig-use-dagger 1 false --mg-eig-use-normop 1 true --mg-eig-use-poly-acc 1 false --mg-eig-n-ev 1 36 --mg-eig-n-kr 1 64 --mg-eig-tol 1 1e-5

Restricting fine-level near-null vectors

When restricting fine-level vectors, there is one additional flag to specify what to do when the number of fine near-nulls does not equal the number of coarse. As a concrete example: what to do for the remaining 8 vectors when --mg-nvec 0 24 and --mg-nvec 1 32.

This is controlled by the variable QudaNullVectorSetupType setup_restrict_remaining_type[QUDA_MAX_MG_LEVEL], which can be set to inverse iterations, Chebyshev filter, or eigenvectors with the enums described above. The corresponding command line flag is --mg-setup-restrict-remaining-type [level] [method].

The previously mentioned variables/command line flags for various setup types are reused for the setup of remaining near-null vectors.

A reference setup with restriction, then using inverse iterations for the remaining 8 vectors (32 on level 2 minus 24 on level 1) is:

--mg-setup-type 1 restrict-fine --mg-setup-restrict-remaining-type 1 inverse-iterations --mg-setup-tol 1 5e-7 --mg-setup-inv 1 cgnr

--mg-setup-restrict-remaining-type can be changed appropriately, grabbing other reference flags as appropriate.

Refining near-null vectors

"Refining" near-null vectors with inverse iterations is enabled by specifying a non-zero number of refinement iterations via int setup_maxiter_inverse_iterations_refinement[QUDA_MAX_MG_LEVEL]; a recommended value is 50. Refinement is done after near-null vectors have been generated with the Cheyshev filter, eigenvectors, or restriction. The corresponding command line flag is --mg-setup-maxiter-inverse-iterations-refinement [level] [numbers]. Parameters for refinement are taken from the flags for inverse iteration setup.

As an example, the parameters for a Chebyshev filter can be included, and then they can be refined for 50 iterations via adding:

--mg-setup-maxiter-inverse-iterations-refinement 1 50 --mg-setup-tol 1 5e-7 --mg-setup-inv 1 cgnr

Running the Multigrid Solver

With the adaptive setup complete we can run the multigrid-accelerated solver. Instrumenting the invertQuda solver interface to use the multigrid preconditioner is straightforward. At present, the only Krylov solver that can be used with the multigrid preconditioner is the GCR solver, and the precision settings of the final solver and preconditioner should match.

The following parameters in QudaInvertParamshould be set:

QudaInvertParam inv_param;
//...set general solver parameters
inv_param.inv_type_precondtion = QUDA_MG_INVERTER; // set the preconditioner type to multigrid

QudaMultigridParam mg_param = newQudaMultigridParam();
//...fill out MG parameter...
void *mg_preconditioner = newMultigridQuda(&mg_param); // create the MG instance
inv_param.preconditioner = mg_preconditioner; // set pointer to the relevenat MG preconditioner instance

Suggested parameters for the GCR solver that wraps the MG preconditioner include

inv_param.inv_type = QUDA_GCR_INVERTER;
inv_param.gcrNKrylov = 20;
inv_param.reliable_delta = 1e-5;

Example

The tests/invert_test.cpp file contains a simple example for how to use the multigrid solver, and is intended to be instructional on how to interface the multigrid solver into other applications. This test support loading gauge fields into it (require QIO support to be enabled), loading previously generated null-space vectors as well as dumping these to disk for future use. A list of the relevant command-line options can be gleamed with the --help option. The help output is categorised into different groups for ease of reading. The general invert options are first, then there are options for using the eigensolver to deflate, using the Eig-CG deflation, and finally the multigrid options. Here is the output:

$ ./invert_test --help
QUDA internal test
Usage: ./invert_test [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  --alternative-reliable BOOL=0
                              use alternative reliable updates
  --anisotropy FLOAT=1        Temporal anisotropy factor (default 1.0)
  --ca-basis-type ENUM:value in {power,chebyshev}=0
                              The basis to use for CA-CG (default power)
  --cheby-basis-eig-max FLOAT=-1
                              Conservative estimate of largest eigenvalue for Chebyshev basis CA-CG (default is to guess with power iterations)
  --cheby-basis-eig-min FLOAT=0
                              Conservative estimate of smallest eigenvalue for Chebyshev basis CA-CG (default 0)
  --clover-coeff FLOAT=0.1    Clover coefficient
  --compute-clover BOOL=0     Compute the clover field or use random numbers (default false)
  --compute-fat-long BOOL=0   Compute the fat/long field or use random numbers (default false)
  --contraction-type ENUM:value in {open,dr}=0
                              Whether to leave spin elemental open, or use a gamma basis and contract on spin (default open)
  --dagger                    Set the dagger to 1 (default 0)
  --device INT:INT in [0 - 16]=-1
                              Set the CUDA device to use (default 0, single GPU only)
  --dslash-type ENUM:value in {wilson,clover,twisted-mass,twisted-clover,clover-hasenbusch-twist,staggered,asqtad,domain-wall,domain-wall-4d,mobius,laplace}=0
                              Set the dslash type
  --epsilon FLOAT=0.01        Twisted-Mass flavor twist of Dirac operator (default 0.01)
  --epsilon-naik FLOAT=0      Epsilon factor on Naik term (default 0.0, suggested non-zero -0.1)
  --flavor ENUM:value in {singlet,deg-doublet,nondeg-doublet,no}=1
                              Set the twisted mass flavor type (singlet (default), deg-doublet, nondeg-doublet)
  --gaussian-sigma FLOAT=0.2  Width of the Gaussian noise used for random gauge field contruction (default 0.2)
  --heatbath-beta FLOAT=6.2   Beta value used in heatbath test (default 6.2)
  --heatbath-coldstart BOOL=0 Whether to use a cold or hot start in heatbath test (default false)
  --heatbath-num-hb-per-step INT=5
                              Number of heatbath hits per heatbath step (default 5)
  --heatbath-num-or-per-step INT=5
                              Number of overrelaxation hits per heatbath step (default 5)
  --heatbath-num-steps INT=10 Number of measurement steps in heatbath test (default 10)
  --heatbath-warmup-steps INT=10
                              Number of warmup steps in heatbath test (default 10)
  --inv-type ENUM:value in {cg,bicgstab,gcr,pcg,mpcg,mpbicgstab,mr,sd,eigcg,inc-eigcg,gmresdr,gmresdr-proj,gmresdr-sh,fgmresdr,mg,bicgstab-l,cgne,cgnr,cg3,cg3ne,cg3nr,ca-cg,ca-cgne,ca-cgnr,ca-gcr}=0
                              The type of solver to use (default cg)
  --inv-deflate BOOL=0        Deflate the inverter using the eigensolver
  --inv-multigrid BOOL=0      Precondition the inverter using multigrid
  --kappa FLOAT=-1            Kappa of Dirac operator (default 0.12195122... [equiv to mass])
  --laplace3D INT=4           Restrict laplace operator to omit the t dimension (n=3), or include all dims (n=4) (default 4)
  --load-gauge TEXT           Load gauge field " file " for the test (requires QIO)
  --Lsdim INT=16              Set Ls dimension size(default 16)
  --mass FLOAT=0.1            Mass of Dirac operator (default 0.1)
  --mass-normalization ENUM:value in {kappa,mass,asym-mass}=0
                              Mass normalization (kappa (default) / mass / asym-mass)
  --matpc ENUM:value in {even-even,odd-odd,even-even-asym,odd-odd-asym}=0
                              Matrix preconditioning type (even-even, odd-odd, even-even-asym, odd-odd-asym)
  --msrc INT=1                Used for testing non-square block blas routines where nsrc defines the other dimension
  --mu FLOAT=0.1              Twisted-Mass chiral twist of Dirac operator (default 0.1)
  --m5 FLOAT=-1.5             Mass of shift of five-dimensional Dirac operators (default -1.5)
  --b5 FLOAT=1.5              Mobius b5 parameter (default 1.5)
  --c5 FLOAT=0.5              Mobius c5 parameter (default 0.5)
  --multishift INT=1          Whether to do a multi-shift solver test or not. Default is 1 (single mass)If a value N > 1 is passed, heavier masses will be constructed and the multi-shift solver will be called
  --ngcrkrylov INT=10         The number of inner iterations to use for GCR, BiCGstab-l, CA-CG (default 10)
  --niter INT=100             The number of iterations to perform (default 10)
  --nsrc INT=1                How many spinors to apply the dslash to simultaneusly (experimental for staggered only)
  --pipeline INT=0            The pipeline length for fused operations in GCR, BiCGstab-l (default 0, no pipelining)
  --prec ENUM:value in {double,single,half,quarter}=4
                              Precision in GPU
  --prec-precondition ENUM:value in {double,single,half,quarter}=-2147483648
                              Preconditioner precision in GPU
  --prec-refine ENUM:value in {double,single,half,quarter}=-2147483648
                              Sloppy precision for refinement in GPU
  --prec-ritz ENUM:value in {double,single,half,quarter}=-2147483648
                              Eigenvector precision in GPU
  --prec-sloppy ENUM:value in {double,single,half,quarter}=-2147483648
                              Sloppy precision in GPU
  --prec-null ENUM:value in {double,single,half,quarter}=-2147483648
                              Precison TODO
  --precon-type ENUM:value in {cg,bicgstab,gcr,pcg,mpcg,mpbicgstab,mr,sd,eigcg,inc-eigcg,gmresdr,gmresdr-proj,gmresdr-sh,fgmresdr,mg,bicgstab-l,cgne,cgnr,cg3,cg3ne,cg3nr,ca-cg,ca-cgne,ca-cgnr,ca-gcr}=-2147483648
                              The type of solver to use (default none (=unspecified)).
  --rank-order INT:value in {col,row}=0
                              Set the [t][z][y][x] rank order as either column major (t fastest, default) or row major (x fastest)
  --recon ENUM:value in {18,13,12,9,8}=18
                              Link reconstruction type
  --recon-precondition ENUM:value in {18,13,12,9,8}=-2147483648
                              Preconditioner link reconstruction type
  --recon-sloppy ENUM:value in {18,13,12,9,8}=-2147483648
                              Sloppy link reconstruction type
  --reliable-delta FLOAT=0.1  Set reliable update delta factor
  --save-gauge TEXT           Save gauge field " file " for the test (requires QIO, heatbath test only)
  --solution-pipeline INT=0   The pipeline length for fused solution accumulation (default 0, no pipelining)
  --solution-type ENUM:value in {mat,mat-dag-mat,mat-pc,mat-pc-dag,mat-pc-dag-mat-pc}=0
                              The solution we desire (mat (default), mat-dag-mat, mat-pc, mat-pc-dag-mat-pc (default for multi-shift))
  --fermion-t-boundary ENUM:value in {periodic,anti-periodic}=-1
                              The fermoinic temporal boundary conditions (anti-periodic (default), periodic
  --solve-type ENUM:value in {direct,direct-pc,normop,normop-pc,normerr,normerr-pc}=3
                              The type of solve to do (direct, direct-pc, normop, normop-pc, normerr, normerr-pc)
  --solver-ext-lib-type ENUM:value in {eigen,magma}=1
                              Set external library for the solvers  (default Eigen library)
  --tadpole-coeff FLOAT=1     Tadpole coefficient for HISQ fermions (default 1.0, recommended [Plaq]^1/4)
  --tol FLOAT=1e-07           Set L2 residual tolerance
  --tolhq FLOAT=0             Set heavy-quark residual tolerance
  --unit-gauge BOOL=0         Generate a unit valued gauge field in the tests. If false, a random gauge is generated (default false)
  --verbosity ENUM:value in {silent,summarize,verbose,debug}=1
                              The the verbosity on the top level of QUDA( default summarize)
  --verify BOOL=1             Verify the GPU results using CPU results (default true)
  --dim INT:INT in [1 - 512]=[24,24,24,24] ... Excludes: --xdim --ydim --zdim --tdim
                              Set space-time dimension (X Y Z T)
  --sdim INT:INT in [1 - 512] Excludes: --xdim --ydim --zdim
                              Set space dimension(X/Y/Z) size
  --xdim INT:INT in [1 - 512]=24 Excludes: --dim --sdim
                              Set X dimension size(default 24)
  --ydim INT:INT in [1 - 512]=24 Excludes: --dim --sdim
                              Set X dimension size(default 24)
  --zdim INT:INT in [1 - 512]=24 Excludes: --dim --sdim
                              Set X dimension size(default 24)
  --tdim INT:INT in [1 - 512]=24 Excludes: --dim
                              Set T dimension size(default 24)
  --partition                 Set the communication topology (X=1, Y=2, Z=4, T=8, and combinations of these)
  --gridsize INT=[1,1,1,1] x 4 Excludes: --xgridsize --ygridsize --zgridsize --tgridsize
                              Set the grid size in all four dimension (default 1 1 1 1)
  --xgridsize INT=1 Excludes: --gridsize
                              Set grid size in X dimension (default 1)
  --ygridsize INT=1 Excludes: --gridsize
                              Set grid size in Y dimension (default 1)
  --zgridsize INT=1 Excludes: --gridsize
                              Set grid size in Z dimension (default 1)
  --tgridsize INT=1 Excludes: --gridsize
                              Set grid size in T dimension (default 1)
  --mg-eig-coarse-guess BOOL=0
                              If deflating on the coarse grid, optionally use an initial guess (default = false)
  --mg-eig-preserve-deflation BOOL=0
                              If the multigrid operator is updated, preserve generated deflation space (default = false)
[Option Group: Eigensolver]
  Options controlling eigensolver
  Options:
    --eig-amax FLOAT:POSITIVE   The maximum in the polynomial acceleration
    --eig-amin FLOAT:POSITIVE   The minimum in the polynomial acceleration
    --eig-ARPACK-logfile TEXT   The filename storing the log from arpack
    --eig-QUDA-logfile TEXT     The filename storing the stdout from the QUDA eigensolver
    --eig-arpack-check BOOL     Cross check the device data against ARPACK (requires ARPACK, default false)
    --eig-check-interval INT    Perform a convergence check every nth restart/iteration in the IRAM,IRLM/lanczos,arnoldi eigensolver
    --eig-compute-svd BOOL      Solve the MdagM problem, use to compute SVD of M (default false)
    --eig-max-restarts INT      Perform n iterations of the restart in the eigensolver
    --eig-n-conv INT             The number of converged eigenvalues requested
    --eig-n-ev INT               The size of eigenvector search space in the eigensolver
    --eig-n-kr INT               The size of the Krylov subspace to use in the eigensolver
    --eig-batched-rotate INT    The maximum number of extra eigenvectors the solver may allocate to perform a Ritz rotation.
    --eig-poly-deg INT          TODO
    --eig-require-convergence BOOL
                                If true, the solver will error out if convergence is not attained. If false, a warning will be given (default true)
    --eig-save-vec TEXT         Save eigenvectors to <file> (requires QIO)
    --eig-load-vec TEXT:FILE    Load eigenvectors to <file> (requires QIO)
    --eig-spectrum ENUM:value in {SR,LR,SM,LM,SI,LI}
                                The spectrum part to be calulated. S=smallest L=largest R=real M=modulus I=imaginary
    --eig-tol FLOAT             The tolerance to use in the eigensolver
    --eig-type ENUM:value in {trlm,irlm,iram}
                                The type of eigensolver to use (default trlm)
    --eig-use-dagger BOOL       Solve the Mdag  problem instead of M (MMdag if eig-use-normop == true) (default false)
    --eig-use-normop BOOL       Solve the MdagM problem instead of M (MMdag if eig-use-dagger == true) (default false)
    --eig-use-poly-acc BOOL     Use Chebyshev polynomial acceleration in the eigensolver
[Option Group: Deflation]
  Options controlling deflation
  Options:
    --df-deflation-grid INT:POSITIVE
                                Set maximum number of cycles needed to compute eigenvectors(default 1)
    --df-eigcg-max-restarts INT:POSITIVE
                                Set how many iterative refinement cycles will be solved with eigCG within a single physical right hand site solve (default 4)
    --df-ext-lib-type ENUM      Set external library for the deflation methods  (default Eigen library)
    --df-location-ritz ENUM     Set memory location for the ritz vectors  (default cuda memory location)
    --df-max-restart-num INT    Set maximum number of the initCG restarts in the deflation stage (default 3)
    --df-max-search-dim INT     Set the size of eigenvector search space (default 64)
    --df-mem-type-ritz ENUM     Set memory type for the ritz vectors  (default device memory type)
    --df-nev INT                Set number of eigenvectors computed within a single solve cycle (default 8)
    --df-tol-eigenval FLOAT     Set maximum eigenvalue residual norm (default 1e-1)
    --df-tol-inc FLOAT          Set tolerance for the subsequent restarts in the initCG solver  (default 1e-2)
    --df-tol-restart FLOAT      Set tolerance for the first restart in the initCG solver(default 5e-5)
[Option Group: MultiGrid]
  Options controlling deflation
  Options:
    --mg-block-size LEVEL INT ...
                                Set the geometric block size for the each multigrid levels transfer operator (default 4 4 4 4)
    --mg-coarse-solve-type LEVEL ENUM:value in {direct,direct-pc,normop,normop-pc,normerr,normerr-pc} ...
                                The type of solve to do on each level (direct, direct-pc) (default = solve_type)
    --mg-coarse-solver LEVEL ENUM:value in {cg,bicgstab,gcr,pcg,mpcg,mpbicgstab,mr,sd,eigcg,inc-eigcg,gmresdr,gmresdr-proj,gmresdr-sh,fgmresdr,mg,bicgstab-l,cgne,cgnr,cg3,cg3ne,cg3nr,ca-cg,ca-cgne,ca-cgnr,ca-gcr} ...
                                The solver to wrap the V cycle on each level (default gcr, only for levels 1+)
    --mg-coarse-solver-ca-basis-size LEVEL INT:POSITIVE ...
                                The basis size to use for CA-CG setup of multigrid (default 4)
    --mg-coarse-solver-ca-basis-type LEVEL ENUM:value in {power,chebyshev} ...
                                The basis to use for CA-CG setup of multigrid(default power)
    --mg-coarse-solver-cheby-basis-eig-max LEVEL FLOAT:POSITIVE ...
                                Conservative estimate of largest eigenvalue for Chebyshev basis CA-CG in setup of multigrid (default is to guess with power iterations)
    --mg-coarse-solver-cheby-basis-eig-min LEVEL FLOAT:POSITIVE ...
                                Conservative estimate of smallest eigenvalue for Chebyshev basis CA-CG in setup of multigrid (default 0)
    --mg-coarse-solver-maxiter LEVEL INT:POSITIVE ...
                                The coarse solver maxiter for each level (default 100)
    --mg-coarse-solver-tol LEVEL FLOAT:POSITIVE ...
                                The coarse solver tolerance for each level (default 0.25, only for levels 1+)
    --mg-eig LEVEL BOOL ...     Use the eigensolver on this level (default false)
    --mg-eig-amax LEVEL FLOAT:POSITIVE ...
                                The maximum in the polynomial acceleration (default 4.0)
    --mg-eig-amin LEVEL FLOAT:POSITIVE ...
                                The minimum in the polynomial acceleration (default 0.1)
    --mg-eig-check-interval LEVEL INT ...
                                Perform a convergence check every nth restart/iteration (only used in Implicit Restart types)
    --mg-eig-max-restarts LEVEL INT:POSITIVE ...
                                Perform a maximun of n restarts in eigensolver (default 100)
    --mg-eig-n-ev LEVEL INT ...  The size of eigenvector search space in the eigensolver
    --mg-eig-n-kr LEVEL INT ...  The size of the Krylov subspace to use in the eigensolver
    --mg-eig-batched-rotate LEVEL INT ...
                                The maximum number of extra eigenvectors the solver may allocate to perform a Ritz rotation.
    --mg-eig-poly-deg LEVEL INT:POSITIVE ...
                                Set the degree of the Chebyshev polynomial (default 100)
    --mg-eig-require-convergence LEVEL BOOL ...
                                If true, the solver will error out if convergence is not attained. If false, a warning will be given (default true)
    --mg-eig-spectrum LEVEL ENUM:value in {SR,LR,SM,LM,SI,LI} ...
                                The spectrum part to be calulated. S=smallest L=largest R=real M=modulus I=imaginary (default SR)
    --mg-eig-tol LEVEL FLOAT:POSITIVE ...
                                The tolerance to use in the eigensolver (default 1e-6)
    --mg-eig-type LEVEL ENUM:value in {trlm,irlm,iram} ...
                                The type of eigensolver to use (default trlm)
    --mg-eig-use-dagger LEVEL BOOL ...
                                Solve the MMdag problem instead of M (MMdag if eig-use-normop == true) (default false)
    --mg-eig-use-normop LEVEL BOOL ...
                                Solve the MdagM problem instead of M (MMdag if eig-use-dagger == true) (default false)
    --mg-eig-use-poly-acc LEVEL BOOL ...
                                Use Chebyshev polynomial acceleration in the eigensolver (default true)
    --mg-generate-all-levels BOOL
                                true=generate null-space on all levels, false=generate on level 0 and create other levels from that (default true)
    --mg-generate-nullspace BOOL
                                Generate the null-space vector dynamically (default true, if set false and mg-load-vec isn't set, creates free-field null vectors)
    --mg-levels INT             The number of multigrid levels to do (default 2)
    --mg-load-vec LEVEL TEXT ...
                                Load the vectors <file> for the multigrid_test (requires QIO)
    --mg-save-vec LEVEL TEXT ...
                                Save the generated null-space vectors <file> from the multigrid_test (requires QIO)
    --mg-low-mode-check BOOL    Measure how well the null vector subspace overlaps with the low eigenmode subspace (default false)
    --mg-mu-factor LEVEL FLOAT ...
                                Set the multiplicative factor for the twisted mass mu parameter on each level (default 1)
    --mg-n-block-ortho LEVEL INT:POSITIVE ...
                                The number of times to run Gram-Schmidt during block orthonormalization (default 1)
    --mg-nu-post LEVEL INT:POSITIVE ...
                                The number of post-smoother applications to do at a given multigrid level (default 2)
    --mg-nu-pre LEVEL INT:POSITIVE ...
                                The number of pre-smoother applications to do at a given multigrid level (default 2)
    --mg-nvec LEVEL INT:POSITIVE ...
                                Number of null-space vectors to define the multigrid transfer operator on a given level
    --mg-oblique-proj-check BOOL
                                Measure how well the null vector subspace adjusts the low eigenmode subspace (default false)
    --mg-omega FLOAT            The over/under relaxation factor for the smoother of multigrid (default 0.85)
    --mg-post-orth BOOL         If orthonormalize the vector after inverting in the setup of multigrid (default true)
    --mg-pre-orth BOOL          If orthonormalize the vector before inverting in the setup of multigrid (default false)
    --mg-schwarz-cycle LEVEL INT:POSITIVE ...
                                The number of Schwarz cycles to apply per smoother application (default=1)
    --mg-schwarz-type LEVEL ENUM ...
                                Whether to use Schwarz preconditioning (requires MR smoother and GCR setup solver) (default false)
    --mg-setup-ca-basis-size LEVEL INT:POSITIVE ...
                                The basis size to use for CA-CG setup of multigrid (default 4)
    --mg-setup-ca-basis-type LEVEL ENUM:value in {power,chebyshev} ...
                                The basis to use for CA-CG setup of multigrid(default power)
    --mg-setup-cheby-basis-eig-max LEVEL FLOAT:POSITIVE ...
                                Conservative estimate of largest eigenvalue for Chebyshev basis CA-CG in setup of multigrid (default is to guess with power iterations)
    --mg-setup-cheby-basis-eig-min LEVEL FLOAT:POSITIVE ...
                                Conservative estimate of smallest eigenvalue for Chebyshev basis CA-CG in setup of multigrid (default 0)
    --mg-setup-inv LEVEL ENUM:value in {cg,bicgstab,gcr,pcg,mpcg,mpbicgstab,mr,sd,eigcg,inc-eigcg,gmresdr,gmresdr-proj,gmresdr-sh,fgmresdr,mg,bicgstab-l,cgne,cgnr,cg3,cg3ne,cg3nr,ca-cg,ca-cgne,ca-cgnr,ca-gcr} ...
                                The inverter to use for the setup of multigrid (default bicgstab)
    --mg-setup-iters LEVEL INT:POSITIVE ...
                                The number of setup iterations to use for the multigrid (default 1)
    --mg-setup-location LEVEL ENUM:value in {cpu,host,gpu,device} ...
                                The location where the multigrid setup will be computed (default cuda)
    --mg-setup-maxiter LEVEL INT ...
                                The maximum number of solver iterations to use when relaxing on a null space vector (default 500)
    --mg-setup-maxiter-refresh LEVEL INT ...
                                The maximum number of solver iterations to use when refreshing the pre-existing null space vectors (default 100)
    --mg-setup-tol LEVEL FLOAT ...
                                The tolerance to use for the setup of multigrid (default 5e-6)
    --mg-setup-type ENUM:value in {test,null}
                                The type of setup to use for the multigrid (default null)
    --mg-smoother LEVEL ENUM:value in {cg,bicgstab,gcr,pcg,mpcg,mpbicgstab,mr,sd,eigcg,inc-eigcg,gmresdr,gmresdr-proj,gmresdr-sh,fgmresdr,mg,bicgstab-l,cgne,cgnr,cg3,cg3ne,cg3nr,ca-cg,ca-cgne,ca-cgnr,ca-gcr} ...
                                The smoother to use for multigrid (default mr)
    --mg-smoother-halo-prec ENUM:value in {double,single,half,quarter}
                                The smoother halo precision (applies to all levels - defaults to null_precision)
    --mg-smoother-solve-type LEVEL ENUM:value in {direct,direct-pc,normop,normop-pc,normerr,normerr-pc} ...
                                The type of solve to do in smoother (direct, direct-pc (default) )
    --mg-smoother-tol LEVEL FLOAT ...
                                The smoother tolerance to use for each multigrid (default 0.25)
    --mg-solve-location LEVEL ENUM:value in {cpu,host,gpu,device} ...
                                The location where the multigrid solver will run (default cuda)
    --mg-verbosity LEVEL ENUM:value in {silent,summarize,verbose,debug} ...
                                The verbosity to use on each level of the multigrid (default summarize)

For example, say we have a lattice contained in the file ~/lattices/wl_5p5_x2p38_um0p4086_cfg_1000.lime of size 16^3x64, that we wish to use, and we want to run the multigrid setup to generate the null space vectors, dump them to disk, and then solve a linear system with mass parameter mass=-0.42. To run this is on two GPUs (two processes) we might do something like this:

export META="--dim 16 16 16 32 --gridsize 1 1 1 2 --mass -0.42 --anisotropy 2.38 --load-gauge wl_5p5_x2p38_um0p4086_cfg_1000.lime --inv-multigrid true"
export SOLVER_PARAM="--prec double --prec-sloppy single --prec-precondition single --tol 1e-8 --inv-type gcr --solve-type direct-pc"
export MG_PARAM="--mg-levels 3 --mg-generate-nullspace true --mg-save-vec /tmp/null"
export LEVEL0_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 0 4 4 4 4 --mg-nvec 0 24 --mg-setup-inv 0 cg --mg-smoother 0 ca-gcr"
export LEVEL1_PARAM="--mg-nu-pre 1 0 --mg-nu-post 1 8 --mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-setup-inv 1 cg --mg-smoother 1 ca-gcr"
export LEVEL2_PARAM="--mg-coarse-solver ca-gcr --mg-coarse-solver-maxiter 8"
mpirun -np 2 tests/invert_test $META $SOLVER_PARAM $MG_PARAM $LEVEL0_PARAM $LEVEL1_PARAM $LEVEL2_PARAM

This will run a 3-level multigrid process, with 16^3x32 volume on each GPU at the fine grid. The first coarse grid will have 4^3x8 geometry per GPU, and the second will have 2^3x4. The parameters to the different multigrid levels are set in the LEVEL0_PARAM and LEVEL1_PARAM variables, e.g., number of null-space vectors per level, block size, solver to use in the dynamic setup, and the solver to use as a smoother. The parameter --mg-smoother-tol parameter is the tolerance used for each level of the solver (excluding the fine grid) when a multigrid K-cycle is used (QudaMultigridParam::cycle_type = QUDA_MG_CYCLE_RECURSIVE), as well as determining the tolerance of the coarsest grid solve.

If we want to reuse these same null-space vectors in a subsequent run, and not rerun the null-space generation, we would use

export META="--dim 16 16 16 32 --gridsize 1 1 1 2 --mass -0.42 --anisotropy 2.38 --load-gauge wl_5p5_x2p38_um0p4086_cfg_1000.lime --inv-multigrid true"
export SOLVER_PARAM="--prec double --prec-sloppy single --prec-precondition single --tol 1e-8 --inv-type gcr --solve-type direct-pc"
export MG_PARAM="--mg-levels 3 --mg-generate-nullspace false --mg-load-vec /tmp/null"
export LEVEL0_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 0 4 4 4 4 --mg-nvec 0 24 --mg-setup-inv 0 cg --mg-smoother 0 ca-gcr"
export LEVEL1_PARAM="--mg-nu-pre 1 0 --mg-nu-post 1 8 --mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-setup-inv 1 cg --mg-smoother 1 ca-gcr"
export LEVEL2_PARAM="--mg-coarse-solver 2 ca-gcr --mg-coarse-solver-maxiter 2 8"
mpirun -np 2 tests/invert_test $META $SOLVER_PARAM $MG_PARAM $LEVEL0_PARAM $LEVEL1_PARAM $LEVEL2_PARAM

Combining Schwarz Preconditioning with Multigrid

QUDA has the ability to use Schwarz preconditioning as a smoother, as well as a preconditioner for the coarsest-grid solver. There are two parameters that are used to control this

QudaMultigridParam::smoother_schwarz_type[QUDA_MAX_MG_LEVEL];
QudaMultigridParam::smoother_schwarz_cycle[QUDA_MAX_MG_LEVEL];

The first parameter is used to enable Schwarz preconditioning on a given level, valid parameters are QUDA_INVALID_SCHWARZ (default), QUDA_ADDITIVE_SCHWARZ and QUDA_MULTIPLICATIVE_SCHWARZ. These equate to --mg-schwarz-type <level false/add/mul> and --mg-schwarz-cycle <level cycle> command-line parameters for invert_test.

QUDA's Schwarz preconditioning applies the domain decomposition at the node process interfaces, so this means that each Schwarz block is exactly the local volume, and hence dependent on the number of processes and process topology. The second parameter is used to determine the number of cycles of the preconditioner that are applied per step (default 1). When doing multiplicative preconditioning, the number of cycles must be even so that the entire lattice is updated. The number of iterations of each Schwarz cycle is set by the nu_pre and nu_post parameters. Hence for every cycle iteration, there are nu iterations of the domain solver.

At present, the Schwarz preconditioner is applied to both the setup solver and to actual smoother in the final multigrid algorithm. This has the implication that the setup solver must support Schwarz preconditioning: in practice (at present) this means that GCR must be used as the setup solver. Similarly, the actual solver used for solving the Schwarz system must be the MR.

For example, in the below example we use additive Schwarz as a smoother for the fine and intermediate levels, and use multiplicative Schwarz preconditioning on the coarsest grid.

export META="--dim 16 16 16 32 --gridsize 1 1 1 2 --mass -0.42 --anisotropy 2.38 --load-gauge gauge.conf --inv-multigrid true"
export SOLVER_PARAM="--prec double --prec-sloppy single --prec-precondition single --tol 1e-8 --inv-type gcr --solve-type direct-pc"
export MG_PARAM="--mg-levels 3 --mg-generate-nullspace true --mg-save-vec /tmp/null"
export LEVEL0_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 0 4 4 4 4 --mg-nvec 0 24 --mg-setup-inv 0 gcr --mg-schwarz-type 0 add --mg-smoother 0 mr"
export LEVEL1_PARAM="--mg-nu-pre 0 0 --mg-nu-post 0 8 --mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-setup-inv 1 gcr --mg-schwarz-type 1 add --mg-smoother 1 mr"
export LEVEL2_PARAM="--mg-schwarz-type 2 mul --mg-schwarz-cycle 2 2 --mg-smoother 2 mr --mg-coarse-solver 2 gcr"
mpirun -np 2 tests/invert_test $META $SOLVER_PARAM $MG_PARAM $LEVEL0_PARAM $LEVEL1_PARAM $LEVEL2_PARAM

Tuning The Multigrid Solver

The Multigrid Solver has many important elements that you need to consider for your problem. You may find that many of them have little to no effect if solving heavy quark or unphysical problems, but for large and physical lattices they must be given proper attention. The tuning is done in two stages.

Levels, Null Vectors, and Blocking Schemes.

Typical problems will not require more than a 3-level solve, and 3-level solves offer good consistency over different configurations, but you may find that a 2-level solve is better. On each level you may choose to use either 24 or 32 null vectors to generate the null space that creates the coarse grid. On each level there is a blocking scheme (or aggregation scheme) that dictates the coarseness of the coarse-grid solver. For example, you wish to perform a 2-level solve on an L^3xT=16^3x32 lattice, with aggregates of size 4^3x8 for a coarse-grid problem of size of 4^4.

In general, QUDA's multigrid parameters can all be requested at runtime, the main exception to this is the aggregate size: the restrictor requires that the aggregate size size declared at compile time in order to generate efficient code. If you receive an error block size not instantiated simply edit lib/restrictor.cu to instantiate the correct template accordingly: the CUDA thread block size should be set to half the value of the geometric aggregate size (this is the checkerboarded aggregate size - the parity is handled separately in the thread y dimension. Hardware constraints mean that only 1024 threads can be launched per CUDA block, so the largest aggregation scheme allowed on any MG level is 4x4x8x8.

Once you have decided on the range of null vectors, levels, and aggregation schemes to test, launch tests/invert_test (or your own test code) for each combination. QUDA's internal hardware tuning routines will then profile these combinations so that further jobs are performed as quickly as possible. This will take some time, especially for large lattices, but need be done only once. If you're testing 2- and 3-level solves, and your coarse 2-level set-up is the same as the intermediate set-up on 3-level (say 2LVL MG=4444 Null=24, and 3LVL MG=4444-2222 Null=24-24) then do all the 2-level tuning first. The profiles for the intermediate level will be picked up and used, saving a lot of time.

Algorithmic and Physical Parameters.

Now that QUDA has profiled each MG combination, you can move on to numerical parameters. One of the most important parameters is the the tolerance of the inner GCR smoother. We have found that setting this to a value of 0.25 usually achieves good results, but you may wish to fine tune this number. Variations of up to 100% in outer solve times can be seen for smoother tolerance values of 0.1-0.3.

When using twisted-mass or twisted-clover fermions, changing the value of mu on the coarsest grid can have a significant effect on the outer solve time. Use the command line option --mg-mu-factor <level> <mu-factor> where <mu-factor> is the multiplicative factor by which mu is increased on the coarse levels. The number of (pre and post) smoother iterations is another variable to consider.

Generalities

It is best to use the CG solver to generate the null space --mg-setup-inv <N level> cg for twisted fermions as it's more stable. For precisions, make sure you pass --prec-sloppy single, --prec-precondition half and --prec-null half. Note that with twisted-clover fermions, half precision is only recommended if DYNAMIC_CLOVER enabled.

Note that all MG command line options refer to the Nth multigrid level with the integer N-1. For example, for a 3-level solve one would pass --mg-block-size 0 4 4 4 8 to set the blocking scheme on the first level, and --mg-setup-inv 1 gcr to set the inverter type on the second level.

Multigrid inverter + Lanczos

There are two ways in which one can use Lanczos to accelerate the multigrid inverter. We construct a QudaEigParam for each level via this exhaustive run-down of the options:

    --mg-eig LEVEL BOOL ...     Use the eigensolver on this level (default false)
    --mg-eig-amax LEVEL FLOAT:POSITIVE ...
                                The maximum in the polynomial acceleration (default 4.0)
    --mg-eig-amin LEVEL FLOAT:POSITIVE ...
                                The minimum in the polynomial acceleration (default 0.1)
    --mg-eig-check-interval LEVEL INT ...
                                Perform a convergence check every nth restart/iteration (only used in Implicit Restart types)
    --mg-eig-coarse-guess BOOL=0
                              If deflating on the coarse grid, optionally use an initial guess (default = false)    
    --mg-eig-max-restarts LEVEL INT:POSITIVE ...
                                Perform a maximun of n restarts in eigensolver (default 100)
    --mg-eig-n-ev LEVEL INT ...  The size of eigenvector search space in the eigensolver
    --mg-eig-n-kr LEVEL INT ...  The size of the Krylov subspace to use in the eigensolver
    --mg-eig-batched-rotate LEVEL INT ...
                                The maximum number of extra eigenvectors the solver may allocate to perform a Ritz rotation.
    --mg-eig-preserve-deflation BOOL=0
                              If the multigrid operator is updated, preserve generated deflation space (default = false)
    --mg-eig-poly-deg LEVEL INT:POSITIVE ...
                                Set the degree of the Chebyshev polynomial (default 100)
    --mg-eig-require-convergence LEVEL BOOL ...
                                If true, the solver will error out if convergence is not attained. If false, a warning will be given (default true)
    --mg-eig-spectrum LEVEL ENUM:value in {SR,LR,SM,LM,SI,LI} ...
                                The spectrum part to be calulated. S=smallest L=largest R=real M=modulus I=imaginary (default SR)
    --mg-eig-tol LEVEL FLOAT:POSITIVE ...
                                The tolerance to use in the eigensolver (default 1e-6)
    --mg-eig-type LEVEL ENUM:value in {trlm,irlm,iram} ...
                                The type of eigensolver to use (default trlm)
    --mg-eig-use-dagger LEVEL BOOL ...
                                Solve the MMdag problem instead of M (MMdag if eig-use-normop == true) (default false)
    --mg-eig-use-normop LEVEL BOOL ...
                                Solve the MdagM problem instead of M (MMdag if eig-use-dagger == true) (default false)
    --mg-eig-use-poly-acc LEVEL BOOL ...
                                Use Chebyshev polynomial acceleration in the eigensolver (default true)

We always use the regular coarse operator for constructing the null space, rather than the preconditioned operator, regardless of what operator is being coarsened.

You should experiment with the Chebyshev acceleration parameters as documented here and here. As a rule of thumb, one can optimise the eigensolver for fine operator, and use these values on each level of multigrid as a good starting set. You will likely need to make minor adjustments for optimal performance though.

Eigenvectors as Prolongation and Restriction Vectors.

One may use the eigenvectors of the normal operator MdagM as the null vectors for the prolongation and restriction of multigrid levels. To enable this feature, use the flag,

--mg-eig <level> <true/false>                     # Use the eigensolver on this level (default false)

This will instruct the QUDA test routine to create a set of eigensolver parameters for the specified level, and will call the eigensolver on that level construction. Be sure to select appropriate eigensolver parameters for your problem. If calling the MG routine from inside you application code, see invert_test.cpp for an example of how QUDA parameters are populated.

Coarse Grid Deflation

One may also deflate the coarsest grid solve. This is enabled via the same flag,

--mg-eig <coarsest level> <true/false>                     # Use the eigensolver on this level (default false)

The same rules apply when it comes to passing eigensolver parameters.

Example Commands

We here give a series of example commands that demonstrate the usage of the two MG+Lanczos features. We shall work with an L=16, T=32 random lattice throughout, use a three level solve with 2222 and 2222 blocking on levels 0 and 1, and use the appropriate solver types and parameters for the given feature. We give first a command that performs the solve using MG only as a reference.

Reference MG command

#!/bin/bash

export QUDA_RESOURCE_PATH=.

ARGSTR="--dim 16 16 16 16 --prec double --prec-sloppy single --prec-precondition half --prec-null half --mg-smoother-halo-prec half --dslash-type clover --inv-type gcr --solve-type direct-pc --verbosity verbose --nsrc 17 "
ARGSTR=$ARGSTR"--kappa 0.1 --compute-clover true --clover-coeff 0.1 "
ARGSTR=$ARGSTR"--inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct-pc --mg-verbosity 0 verbose "
ARGSTR=$ARGSTR"--mg-setup-inv 0 cgnr --mg-setup-maxiter 0 1000 --mg-setup-tol 0 1e-5 "
ARGSTR=$ARGSTR"--mg-setup-inv 1 cgnr --mg-setup-maxiter 1 1000 --mg-setup-tol 1 1e-5 "
ARGSTR=$ARGSTR"--mg-coarse-solve-type 0 direct-pc --mg-smoother-solve-type 0 direct-pc "
ARGSTR=$ARGSTR"--mg-block-size 0 2 2 2 2 --mg-nvec 0 24 --mg-n-block-ortho 0 2 --mg-coarse-solve-type 1 direct-pc --mg-smoother-solve-type 0 direct-pc --mg-smoother 0 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 2 --mg-smoother-tol 0 1e-10 --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter \
1 16   --mg-coarse-solver 1 gcr --mg-verbosity 1 verbose "
ARGSTR=$ARGSTR"--mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-n-block-ortho 1 2 --mg-coarse-solve-type 2 direct-pc --mg-smoother-solve-type 1 direct-pc --mg-smoother 1 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 2 --mg-smoother-tol 1 1e-10 --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter \
2 16   --mg-coarse-solver 2 ca-gcr --mg-verbosity 2 verbose "

COMMAND="./invert_test $ARGSTR"

#echo $COMMAND
#$COMMAND

This is a fairly typical type of MG solve. Note ca-gcr is the favoured solver on the coarse level, and the tolerance is very loose. This is (for Wilson type solves) the recommended set-up.

Lanczos MG command

Here we give the same command, but now with Lanczos parameters added:

#!/bin/bash

export QUDA_RESOURCE_PATH=.

ARGSTR="--dim 16 16 16 16 --prec double --prec-sloppy single --prec-precondition half --prec-null half --mg-smoother-halo-prec half --dslash-type clover --inv-type gcr --solve-type direct-pc --verbosity verbose --nsrc 17 "
ARGSTR=$ARGSTR"--kappa 0.1 --compute-clover true --clover-coeff 0.1 "
ARGSTR=$ARGSTR"--inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct-pc --mg-verbosity 0 verbose "
ARGSTR=$ARGSTR"--mg-setup-inv 0 cgnr --mg-setup-maxiter 0 1000 --mg-setup-tol 0 1e-5 "
ARGSTR=$ARGSTR"--mg-setup-inv 1 cgnr --mg-setup-maxiter 1 1000 --mg-setup-tol 1 1e-5 "
ARGSTR=$ARGSTR"--mg-coarse-solve-type 0 direct-pc --mg-smoother-solve-type 0 direct-pc "
ARGSTR=$ARGSTR"--mg-block-size 0 2 2 2 2 --mg-nvec 0 24 --mg-n-block-ortho 0 2 --mg-coarse-solve-type 1 direct-pc --mg-smoother-solve-type 0 direct-pc --mg-smoother 0 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 2 --mg-smoother-tol 0 1e-10 --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter \
1 16   --mg-coarse-solver 1 gcr --mg-verbosity 1 verbose "
ARGSTR=$ARGSTR"--mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-n-block-ortho 1 2 --mg-coarse-solve-type 2 direct-pc --mg-smoother-solve-type 1 direct-pc --mg-smoother 1 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 2 --mg-smoother-tol 1 1e-10 --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter \
2 16   --mg-coarse-solver 2 ca-gcr --mg-verbosity 2 verbose "
ARGSTR=$ARGSTR"--mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true "
ARGSTR=$ARGSTR"--mg-nvec 2 64 --mg-eig-n-ev 2 66 --mg-eig-n-kr 2 198 --mg-eig-tol 2 1e-4 --mg-eig-use-poly-acc 2 false "
COMMAND="./invert_test $ARGSTR"

#echo $COMMAND
#$COMMAND

This solves the same problem, with the same MG blocking scheme, but with the addition of Lanczos at every step. Good luck!

MG+Lanczos To Do

  • Amalgamate the eigensolver code with the EigCG code as much as possible. E.g., use shared parameters, use a single deflation code path, check EicCG vectors with low prec Lanczos solves, make the deflation.cpp host routines faster with template instantiated sizes of MatrixXcd, such as 24, 32, 48, 64.
  • Ensure the solver variables n_ev, n_kr, and n_conv are used sensibly throughout the eigensolver. E.g., only save/load n_conv vectors, trim vector arrays where appropriate.
  • Make a staggered/Laplace eigensolve test script.

To Do

The following is an incomplete list of what needs to be done:

  • Optimize CPU coarse grid operator - at present no attempt has been made to optimize the CPU coarse grid operator despite it being on the critical path in a multi-level implementation since we have the expectation that the coarsest grid will be on the CPU. Optimization here will almost certainly include OpenMP parallelisation as well as potentially extending the QUDA autotuner to encompass CPU kernels, as well as vectorization.
  • Optimize CPU BLAS kernels
  • Implement reduced precision for halo exchange, 8-bit and 16-bit halo exchange
  • Improve how precision is exposed in the interface
  • understand james Osborn's set up process "For the setup I ran several passes of ~100 bicgstab iterations on a random vector until I determined that the rayleigh quotient wasn't being reduced much, then used that vector. I then started with the vector I had before the previous bicgstab pass and orthogonalized it against the current set of null vectors, then ran another set of bicgstab passes on that, and so on until I have enough vectors."
  • block solvers as smoothers
  • multi rhs interface
  • add support to overlap halo exchange with interior compute
  • fix the profiling to correctly report the GCR wrapper solver
  • merge the multi-grid/generic ghost zone with the dslash ghost zones - this will enable peer-to-peer for MG
  • look at how BiCGstab works as an outer solver and coarse-grid solver

Memory reduction strategies

Possible strategies for further memory reductions

  • There is redundant information having both of B and V arrays since V is just the block orthogonalized version of B. This is complicated by the fact that B is an array of vectors and V is a block packed vector. Note we could just store the coarsened vectors B_c since we can reconstruct the B vectors from B[i] = P * B_c[i]
  • There are multiples places throughout the solver were have persistent fields allocated but we do not simultaneously use them, e.g., temporaries is MG vs Transfer vs temporaries used in the solvers. If we can share the temporaries between these then we would be able to reduce the overall memory footprint.
⚠️ **GitHub.com Fallback** ⚠️