Split Grid - lattice/quda GitHub Wiki
Split grid is one of the answers to the multi-rhs problem for domain wall/M"obius fermion, for which the multi-rhs performance is typically limited by communication bandwidth. It re-distributes the right-hand-sides into sub-partitions of the node/GPU grid, within which the communication bandwidth is usually larger. Each sub-partition of the grid performs inversion for one or several of the rhs. The re-distribution also reduces the surface over volume ratio since the local volume on each node/GPU is smaller, thus reduces the total amount of communication needed.
Split grid typically fits into scenarios part of which needs consumes large chunk of memory, e.g. Lanczos/deflation with large number of eigenvectors. Large memory usage typically requires strong scaling, which typically renders the problem bottle-necked by communication bandwidth.
The split grid interface looks like
void invertMultiSrcQuda(void **_hp_x, void **_hp_b, QudaInvertParam *param, void *h_gauge,
QudaGaugeParam *gauge_param);
(There are two variance of this for staggered and clover-type fermions)
void invertMultiSrcStaggeredQuda(void **_hp_x, void **_hp_b, QudaInvertParam *param, void *milc_fatlinks,
void *milc_longlinks, QudaGaugeParam *gauge_param);
void invertMultiSrcCloverQuda(void **_hp_x, void **_hp_b, QudaInvertParam *param, void *h_gauge,
QudaGaugeParam *gauge_param, void *h_clover, void *h_clovinv);
and two new options are added to QudaInvertParam
,
int num_src_per_sub_partition; /**< Number of sources in the multiple source solver, but per sub-partition */
/**< The grid of sub-partition according to which the processor grid will be partitioned.
Should have:
split_grid[0] * split_grid[1] * split_grid[2] * split_grid[3] * num_src_per_sub_partition == num_src. **/
int split_grid[QUDA_MAX_DIM];
as well as the command line option --grid-partition
for invert/dslash_test
and staggered_invert/dslash_test
. The environment variable QUDA_TEST_GRID_PARTITION
can be set to override the effect of --grid-partition
, similar to QUDA_TEST_GRID_SIZE
.
The split grid routine
- forward-distributes the gauge, color spinor, and clover fields onto the sub-partitions
- Perform the inversions
- backward-distributes the solution color spinor fields
The routines that are used to split the fields (single partition to multiple partitions) and join the field (multiple partitions to single partition) are in include/split_grid.h
:
template <class Field>
void inline split_field(Field &collect_field, std::vector<Field *> &v_base_field, const CommKey &comm_key) {...}
template <class Field>
void inline join_field(std::vector<Field *> &v_base_field, const Field &collect_field, const CommKey &comm_key) {...}
As an example the following run with (note the --grid-partition
option)
--dslash-type mobius --dim 16 16 16 12 \
--nsrc 8 \
--Lsdim 12 --b5 1.5 --c5 0.5 --mass 0.0152 \
--verbosity verbose --alternative-reliable true \
--solution-type mat --matpc even-even --solve-type normop-pc \
--prec double --prec-sloppy half --reliable-delta 0.1 \
--recon 12 \
--inv-type cg \
--tol 1e-10 --niter 14000 \
--gridsize 2 2 2 6 \
--grid-partition 2 2 2 1"
on Summit it gives an 2.5x (= 26.0/(3.5+6.9))
speed up:
- with split grid:
invertMultiSrcQuda
totals the time to re-assemble the rhs, set up communicators, etc andinvertQuda
performs the actual inversion on each sub-partitions.
invertQuda Total time = 6.86142 secs
download = 0.280194 secs ( 4.08%), with 1 calls at 2.801940e+05 us per call
upload = 0.223691 secs ( 3.26%), with 1 calls at 2.236910e+05 us per call
init = 0.193237 secs ( 2.82%), with 1 calls at 1.932370e+05 us per call
preamble = 0.307307 secs ( 4.48%), with 2 calls at 1.536535e+05 us per call
compute = 5.553183 secs ( 80.9%), with 1 calls at 5.553183e+06 us per call
epilogue = 0.232987 secs ( 3.4%), with 3 calls at 7.766233e+04 us per call
free = 0.000269 secs (0.00392%), with 2 calls at 1.345000e+02 us per call
total accounted = 6.790868 secs ( 99%)
total missing = 0.070554 secs ( 1.03%)
invertMultiSrcdQuda Total time = 3.48972 secs
init = 0.448977 secs ( 12.9%), with 1 calls at 4.489770e+05 us per call
preamble = 1.691328 secs ( 48.5%), with 1 calls at 1.691328e+06 us per call
epilogue = 1.349407 secs ( 38.7%), with 1 calls at 1.349407e+06 us per call
total accounted = 3.489712 secs ( 100%)
total missing = 0.000004 secs (0.000115%)
- without split grid:
invertQuda Total time = 25.9826 secs
download = 0.101608 secs ( 0.391%), with 8 calls at 1.270100e+04 us per call
upload = 0.143148 secs ( 0.551%), with 8 calls at 1.789350e+04 us per call
init = 0.179461 secs ( 0.691%), with 8 calls at 2.243262e+04 us per call
preamble = 0.269723 secs ( 1.04%), with 16 calls at 1.685769e+04 us per call
compute = 24.838047 secs ( 95.6%), with 8 calls at 3.104756e+06 us per call
epilogue = 0.363463 secs ( 1.4%), with 24 calls at 1.514429e+04 us per call
free = 0.005908 secs (0.0227%), with 16 calls at 3.692500e+02 us per call
total accounted = 25.901358 secs ( 99.7%)
total missing = 0.081212 secs ( 0.313%)