HISQ mixed precision deflation - lattice/quda GitHub Wiki
For this case study we are using a 48x48x48x12 configuration as provided by the HotQCD collaboration.
Parameter | Value |
---|---|
Volume | 48x48x48x12 |
Gauge action | Improved Symanzik |
beta | 6.794 |
Fermion action | HISQ fermions |
light quark mass | 0.00167 |
strange quark mass | 0.0450 |
Here were are going to examine the quark mass dependence of the solve time as we scale from the light to strange quark masses and progressively optimize the solver with mixed precision and deflation. For this study we are using the staggered_invert_test
example code that is included with QUDA and run on a workstation using 2x Quadro GV100 GPUs. All of these runs are done using a launch syntax of the form
export QUDA_RESOURCE_PATH=.
ARGS="--dim 48 48 24 12 --gridsize 1 1 2 1 --load-gauge l4812f21b6794m00167m0450c_130.quda --compute-fat-long true --test 1"
mpirun -np 2 tests/staggered_invert_test $ARGS $RECON $PREC $SOLVER $EIG --mass 0.00167 --verbosity verbose
where we will adjust the variables SOLVER
, PREC
, and RECON
according to the solver parameters as desired.
Starting point - Pure Double CG solver
Our initial starting point is using a standard double precision CG solver. This uses the parameters:
SOLVER="--inv-type cg --tol 1e-10 --reliable-delta 0.001 --niter 20000"
PREC="--prec double"
which means we run a CG solver to a relative residual tolerance of 1e-10, reliably updating the true residual every time the iterated residual drops by 3 orders of magnitude, with a maximum iteration count of 20000 using double precision only.
With HISQ fermions we can also use compression on the long-link field to reduce the memory traffic. We do so with these parameters
RECON="--recon 13 --recon-sloppy 9"
where we only use the maximal reconstruct-9 compression on the sloppy updates to ensure stability. In doing so we see that the iteration count is constant, and we improve the overall solve time by around 1.2x. From now on, we will assume that gauge compression is always used.
no-recon | recon 13/9 | |||||
---|---|---|---|---|---|---|
Mass | Iterations | Time | GFLOPS | Iterations | Time | GFLOPS |
0.00167 | 6178 | 20.9 | 468 | 6178 | 17.5 | 556 |
0.00334 | 4551 | 15.4 | 467 | 4551 | 12.9 | 557 |
0.00668 | 2621 | 8.89 | 467 | 2621 | 7.46 | 556 |
0.01336 | 1431 | 4.87 | 467 | 1431 | 4.09 | 556 |
0.02672 | 759 | 2.60 | 467 | 759 | 2.18 | 556 |
0.05344 | 400 | 1.39 | 467 | 400 | 1.16 | 556 |
Mixed-Precision CG
The first significant performance boost is to enable mixed-precision CG. To do so, we simple set the sloppy precision to a lower precision than the outer solver precision. Valid values are single
, half
and quarter
, with the latter two formats being QUDA's custom block-fixed-point formats. E.g.,
SOLVER="--inv-type cg --tol 1e-10 --reliable-delta 0.1 --niter 10000"
PREC="--prec double --prec-sloppy single"
Note we also change the reliable-delta parameter such that the true residual is recomputed every time the residual drops by an order of magnitude. This is to minimize the divergence of the solver due to the reduced precision.
double-single | double-half | double-quarter | |||||||
---|---|---|---|---|---|---|---|---|---|
Mass | Iterations | Time | GFLOPS | Iterations | Time | GFLOPS | Iterations | Time | GFLOPS |
0.00167 | 6179 | 8.73 | 1120 | 7710 | 6.65 | 1830 | 14973 | 8.59 | 2760 |
0.00334 | 4552 | 6.43 | 1120 | 5118 | 4.40 | 1830 | 7786 | 4.51 | 2730 |
0.00668 | 2621 | 3.72 | 1120 | 2884 | 2.51 | 1820 | 3790 | 2.24 | 2690 |
0.01336 | 1431 | 2.04 | 1110 | 1456 | 1.29 | 1800 | 1869 | 1.13 | 2640 |
0.02672 | 759 | 1.01 | 1100 | 759 | 0.678 | 1790 | 900 | 0.570 | 2540 |
0.05344 | 400 | 0.594 | 1090 | 400 | 0.374 | 1730 | 487 | 0.325 | 2440 |
These results demonstrate that more GFLOPS doesn't mean reduced time to solution. In particular we see that
- the double-single solver has almost identical iteration count to the pure double solver, and as a result has the expected ideal 2x speedup
- while double-half shows increased iteration count at small quark mass, overall it is stable and represents the sweet spot with a 2.6x speedup over the pure double solver at the light quark mass and 3.1x speedup at the heavy quark mass
- the double-quarter solver is not numerically stable as we reduce the quark mass. At the light quark mass we more than double the iteration count for a solve time similar to the double-single solver, though at heavy quark masses we achieve a 3.6x speedup
Deflation
Introduction
We now consider adding deflation to accelerate the solver with the goal of removing the critical slowing down with quark mass of the solver. For this we shall use the thick-restarted Lanczos eigensolver implemented in QUDA, the usage of which requires some consideration with respect to the parameters to use.
- The number of eigenvectors to deflate with. This will of course be problem dependent, a greater deflation space will result in a better conditioned system to solve with, but increases the setup time, memory costs and deflation overhead. There are actually two parameters exposed in QUDA: the number of eigenvectors we attempt to converge and the number of eigenvectors that are required to converge. The number of deflation modes we will use corresponds to the latter.
- The size of the Krylov space the eigen-solver should construct before triggering a restart. Typically this is 1.2-2x larger than the number of the desired eigenvalues.
- Polynomial acceleration parameters: for finding the low eigenvalues of an operator it is usually optimal to use polynomial acceleration to isolate the part of the eigen-spectrum one is interested in.
- The precision of the eigensolver. In principle we can run the eigensolver in any of the precisions possible in QUDA, however, as we shall see below, in practice it makes most sense to consider single or half precision.
- How often to re-deflate the residual. In infinite precision, the deflation need only be applied once prior to the solver, however, in finite precision, and especially with mixed-precision solvers, we will need to re-deflate the residual vector to ensure optimal convergence.
Before we consider the full mass sweep we initially focus on getting the eigenvectors for the light quark mass only. To enable deflation in QUDA's solver we use the following options
--prec-eigensolver # precision of the eigensolver and subsequent deflation"
--inv-deflate # enable initial eigen-solver and deflation
--eig-poly-deg # use polynomial acceleration of degree n
--eig-amax # sets the upper bound for the polynomial acceleration (optional)
--eig-amin # set the lower bound of the polynomial acceleration
--eig-n-ev # number of eigenvectors to find
--eig-n-conv # number of converged eigenvectors required
--eig-n-kr # size of Krylov search space
--eig-tol # tolerance of the eigensolver
--df-tol-restart # how often to re-deflate the residual
Parameter Tuning
Tuning an eigen-solver can require per-problem tuning, or at least per-fermion-action and per-physics-parameter tuning. While we cover the basics of the tuning here, for more through details we refer the reader here.
Needing to start somewhere, we pick a Chebyshev polynomial of degree 50. Next we need to set the part of the spectrum we desire to be excluded in order to accelerate the eigensolver for the low mode. The --eig-amax
parameter should correspond to a value greater than the largest eigenvalue of the operator: however, if the user does not specify the upper bound of the operator the eigensolver will first approximate the largest eigenvalue bound using a Raleigh quotient. The lower bound (--eig-amin
) requires a bit more tuning, which we do iteratively starting from a larger than optimal value and decrease until we find fast convergence, being careful not to decrease too far. In the TRLM solver output, the metric of interest is the line
TRLM computed the requested 16 vectors in 63 restart steps and 949 OP*x operations.
which tells us the number of solver restarts required and the number of polynomial applications required. The cost will scale roughly as the product of the polynomial degree and the number of OP*x operations.
As an example, we run with the following parameters
PREC="--prec double --prec-sloppy half --prec-precondition single"
EIG="--inv-deflate true --eig-amax 22 --eig-amin 1.0 --eig-poly-deg 50 --eig-n-ev 16 --eig-n-conv 16 --eig-n-kr 32 --eig-tol 1e-6"
to obtain the lowest 16 eigenvectors of the operator, in single precision, and then use these to deflate the operator. Here we have set the tolerance to 1e-6 which is more than sufficient to achieve a good deflation.
poly-deg | eig-min | eig-max | restarts | OP*x | setup time |
---|---|---|---|---|---|
50 | 1.0 | 22 | 63 | 949 | 60.9 |
50 | 0.1 | 22 | 34 | 518 | 33.3 |
50 | 0.01 | 22 | 23 | 350 | 22.6 |
80 | 0.01 | 22 | 15 | 238 | 24.6 |
80 | 0.001 | 22 | 12 | 191 | 19.6 |
80 | 0.0001 | 22 | - | - | - |
Pushing the lower bound to 0.0001 resulted in lack of convergence, indicating that we had encroached on the desired eigen spectrum. At this point we conclude that we have reasonable parameters to obtain the lowest 16 eigenvectors.
Optimizing the Solver
The goal now is to find a suitable number of eigenvectors for an optimal deflation to minimize the time to solution of the final solver. Note as we increase the number of eigenvectors, the upper bound of the deflation spectrum will increase; this in turn will require that we increase the lower bound of the polynomial acceleration.
Below we give the parameters and results from generating 32-512 eigenvectors. We can see that while doubling the number of eigenvectors comes with increased cost, the initial cost is milder than linear scaling, however, as we increase the size of the Krylov space and less restarts are required, the cost scaling trends towards linear, and beyond.
n_conv | n_eig | n_kr | eig-min | eig-max | restarts | OP*x | setup time | solver iteration | solve time |
---|---|---|---|---|---|---|---|---|---|
16 | 16 | 32 | 0.001 | 22 | 12 | 191 | 19.6 | 2671 | 2.33 |
32 | 32 | 64 | 0.005 | 22 | 8 | 240 | 24.8 | 1812 | 1.59 |
64 | 64 | 128 | 0.005 | 22 | 4 | 271 | 28.3 | 1153 | 1.03 |
128 | 128 | 256 | 0.01 | 22 | 2 | 353 | 37.6 | 691 | 0.663 |
256 | 256 | 512 | 0.01 | 22 | 1 | 512 | 55.8 | 448 | 0.457 |
512 | 512 | 1024 | 0.05 | 22 | 1 | 1024 | 126 | 295 | 0.343 |
In the final column the solve time for the resulting deflated solve is given. Since the actual solver being subsequently run is double-half here, we expect that we will need to re-deflate the residual each time it is reduced by three orders of magnitude which is around the accuracy possible with half precision: hence we set --df-tol-restart 1e-3
. It is clear to see that increasing the size of the deflation space leads to a clear speedup in the final solver time, though the actual sweet spot for the deflation space will depend on the number of right hands sides, and whether the cost of the initial deflation can be amortized. With a deflation space of size 512, the light quark solve time of 0.343 seconds represents a speedup of around 19x over the regular double-half solver.
We now consider running the eigensolver in half precision, with all other parameters identical. Now when we run the eigensolver in half precision, we of course lose precision in the generated eigenvectors. This can have a significant effect on the quality of the deflation. As a result, in order to maintain optimal convergence rate we require to re-deflate the solver with increasing frequency. We show results for three different values of --df-tol-restart
: at small deflation space, we see significant dependence on the frequency of the deflation space, and re-deflating for every order of magnitude residual drop is optimal. However, as we increase the deflation space size, the dependence on the deflation frequency weakens, and we see identical convergence as with single-precision deflation for both 0.1 and 0.01 deflation restarts. Hence, when properly tuned, there is no disadvantage from running the eigensolver in half precision.
0.001 | 0.01 | 0.1 | |||||
---|---|---|---|---|---|---|---|
n_conv | setup time | solver iteration | solve time | solver iteration | solve time | ||
16 | 8.90 | 5935 | 2.33 | 5045 | 4.47 | 3604 | 3.23 |
32 | 12.9 | 2806 | 1.59 | 2181 | 1.97 | 1890 | 1.69 |
64 | 15.9 | 2359 | 1.03 | 1679 | 1.53 | 1225 | 1.11 |
128 | 21.7 | 791 | 0.663 | 691 | 0.843 | 691 | 0.676 |
256 | 33.2 | 507 | 0.457 | 448 | 0.458 | 448 | 0.491 |
512 | 72.2 | 334 | 0.343 | 295 | 0.346 | 295 | 0.407 |
1024 | 167 | 228 | 0.300 | 199 | 0.300 | 199 | 0.421 |
If we compare the eigensolver setup times, we see a significant reduction in time from running in single precision. This represents one of the main motivations from running the eigensolver in half precision (reduce memory overheads being the other). For a fixed setup cost and memory overhead, we can generate a more effective deflation in half precision than we can with single precision.
Finally we consider the combination of a half precision eigensolver together with a double-quarter CG solver.
0.001 | 0.01 | 0.1 | |||||
---|---|---|---|---|---|---|---|
n_conv | setup time | solver iteration | solve time | solver iteration | solve time | solver iteration | solve time |
16 | 8.90 | 10509 | 6.11 | 9883 | 5.91 | 7595 | 4.66 |
32 | 12.9 | 4163 | 2.52 | 2629 | 1.62 | 2416 | 1.54 |
64 | 15.9 | 4936 | 2.97 | 2635 | 1.63 | 1425 | 0.937 |
128 | 21.7 | 809 | 0.549 | 814 | 0.564 | 805 | 0.574 |
256 | 33.2 | 515 | 0.381 | 515 | 0.390 | 517 | 0.425 |
512 | 72.2 | 381 | 0.300 | 405 | 0.342 | 359 | 0.372 |
1024 | 167 | 286 | 0.283 | 241 | 0.284 | 254 | 0.356 |