Split Grid - lattice/quda GitHub Wiki

Split grid is one of the answers to the multi-rhs problem for domain wall/M"obius fermion, for which the multi-rhs performance is typically limited by communication bandwidth. It re-distributes the right-hand-sides into sub-partitions of the node/GPU grid, within which the communication bandwidth is usually larger. Each sub-partition of the grid performs inversion for one or several of the rhs. The re-distribution also reduces the surface over volume ratio since the local volume on each node/GPU is smaller, thus reduces the total amount of communication needed.

Split grid typically fits into scenarios part of which needs consumes large chunk of memory, e.g. Lanczos/deflation with large number of eigenvectors. Large memory usage typically requires strong scaling, which typically renders the problem bottle-necked by communication bandwidth.

The split grid interface looks like

  void invertMultiSrcQuda(void **_hp_x, void **_hp_b, QudaInvertParam *param, void *h_gauge,
                           QudaGaugeParam *gauge_param);

(There are two variance of this for staggered and clover-type fermions)

  void invertMultiSrcStaggeredQuda(void **_hp_x, void **_hp_b, QudaInvertParam *param, void *milc_fatlinks,
                                   void *milc_longlinks, QudaGaugeParam *gauge_param);

  void invertMultiSrcCloverQuda(void **_hp_x, void **_hp_b, QudaInvertParam *param, void *h_gauge,
                                QudaGaugeParam *gauge_param, void *h_clover, void *h_clovinv);

and two new options are added to QudaInvertParam,

    int num_src_per_sub_partition; /**< Number of sources in the multiple source solver, but per sub-partition */

    /**< The grid of sub-partition according to which the processor grid will be partitioned.
    Should have:
      split_grid[0] * split_grid[1] * split_grid[2] * split_grid[3] * num_src_per_sub_partition == num_src. **/
    int split_grid[QUDA_MAX_DIM];

as well as the command line option --grid-partition for invert/dslash_test and staggered_invert/dslash_test. The environment variable QUDA_TEST_GRID_PARTITION can be set to override the effect of --grid-partition, similar to QUDA_TEST_GRID_SIZE.

The split grid routine

  1. forward-distributes the gauge, color spinor, and clover fields onto the sub-partitions
  2. Perform the inversions
  3. backward-distributes the solution color spinor fields

The routines that are used to split the fields (single partition to multiple partitions) and join the field (multiple partitions to single partition) are in include/split_grid.h:

  template <class Field>
  void inline split_field(Field &collect_field, std::vector<Field *> &v_base_field, const CommKey &comm_key) {...}

  template <class Field>
  void inline join_field(std::vector<Field *> &v_base_field, const Field &collect_field, const CommKey &comm_key) {...}

As an example the following run with (note the --grid-partition option)

 --dslash-type mobius --dim 16 16 16 12 \
 --nsrc 8 \
 --Lsdim 12 --b5 1.5 --c5 0.5 --mass 0.0152 \
 --verbosity verbose --alternative-reliable true \
 --solution-type mat --matpc even-even --solve-type normop-pc \
 --prec double --prec-sloppy half --reliable-delta 0.1 \
 --recon 12 \
 --inv-type cg \
 --tol 1e-10 --niter 14000 \
 --gridsize 2 2 2 6 \
 --grid-partition 2 2 2 1"

on Summit it gives an 2.5x (= 26.0/(3.5+6.9)) speed up:

  • with split grid: invertMultiSrcQuda totals the time to re-assemble the rhs, set up communicators, etc and invertQuda performs the actual inversion on each sub-partitions.
             invertQuda Total time = 6.86142 secs
                 download     = 0.280194 secs (  4.08%),   with        1 calls at 2.801940e+05 us per call
                   upload     = 0.223691 secs (  3.26%),   with        1 calls at 2.236910e+05 us per call
                     init     = 0.193237 secs (  2.82%),   with        1 calls at 1.932370e+05 us per call
                 preamble     = 0.307307 secs (  4.48%),   with        2 calls at 1.536535e+05 us per call
                  compute     = 5.553183 secs (  80.9%),   with        1 calls at 5.553183e+06 us per call
                 epilogue     = 0.232987 secs (   3.4%),   with        3 calls at 7.766233e+04 us per call
                     free     = 0.000269 secs (0.00392%),  with        2 calls at 1.345000e+02 us per call
        total accounted       = 6.790868 secs (    99%)
        total missing         = 0.070554 secs (  1.03%)

    invertMultiSrcdQuda Total time = 3.48972 secs
                     init     = 0.448977 secs (  12.9%),   with        1 calls at 4.489770e+05 us per call
                 preamble     = 1.691328 secs (  48.5%),   with        1 calls at 1.691328e+06 us per call
                 epilogue     = 1.349407 secs (  38.7%),   with        1 calls at 1.349407e+06 us per call
        total accounted       = 3.489712 secs (   100%)
        total missing         = 0.000004 secs (0.000115%)

  • without split grid:
             invertQuda Total time = 25.9826 secs
                 download     = 0.101608 secs ( 0.391%),   with        8 calls at 1.270100e+04 us per call
                   upload     = 0.143148 secs ( 0.551%),   with        8 calls at 1.789350e+04 us per call
                     init     = 0.179461 secs ( 0.691%),   with        8 calls at 2.243262e+04 us per call
                 preamble     = 0.269723 secs (  1.04%),   with       16 calls at 1.685769e+04 us per call
                  compute     = 24.838047 secs (  95.6%),  with        8 calls at 3.104756e+06 us per call
                 epilogue     = 0.363463 secs (   1.4%),   with       24 calls at 1.514429e+04 us per call
                     free     = 0.005908 secs (0.0227%),   with       16 calls at 3.692500e+02 us per call
        total accounted       = 25.901358 secs (  99.7%)
        total missing         = 0.081212 secs ( 0.313%)

⚠️ **GitHub.com Fallback** ⚠️