Chroma ECP - lattice/quda GitHub Wiki

The Chroma ECP HMC benchmark concerns running a 2+1 flavor Stout-improved clover fermion simulation. All solves are offloaded from Chroma and run in QUDA, with everything else run on GPUs using qdpjit.

Monte Carlo Algorithm Details

  • The two-flavor determinant contribution is preconditioned using three levels of Hasenbusch mass preconditioning
  • The one-flavor determinant contribution is evaluated using RHMC
  • A two-level time integration is used, with a fourth-order force-gradient integrator deployed*. The pure-gauge contribution and heaviest two-flavor fermionic contributions are on the fine timescale, with all fermionic contributions on the coarse timescale.

* The original Titan-baseline and Summit-baseline results used a minimum-norm Omelyan second-order integrator.

Solver Details

  • The two-flavor solves all utilize QUDA's adaptive multigrid algorithm**, where the null-space is computed using the light mass and applied to all heavier solves. The outer solver is single-precision GCR, with double-precision defect correction employed. The multigrid preconditioner is mostly run in half precision, with strategic use of fixed-point int32 precision to ensure determinism. On architectures that support it, tensor-core acceleration is applied in the multigrid setup phase.
  • The one-flavor solve utilizes a mixed-precision multi-shift CG algorithm, where the multi-shift solver is run in double-single precision, with per-shift refinement applied in double-half precision.

** The original Titan-baseline and Summit-baseline results used an additive Schwarz preconditioner instead of adaptive multigrid.

Results

Machine algorithm GPU #GPU Time (s)
Titan baseline NVIDIA Tesla K20X 1024 4006
Titan MG + FG NVIDIA Tesla K20X 512 974
Summit baseline NVIDIA Tesla V100 128 1878
Summit MG + FG NVIDIA Tesla V100 128 329
Juelich booster MG + FG NVIDIA A100 SXM 64 285
Juelich booster MG + FG NVIDIA A100 SXM 128 166
Selene MG + FG NVIDIA A100 SXM 64 241
Selene MG + FG NVIDIA A100 SXM 128 150
Spock MG + FG AMD MI100 64 973
Spock MG + FG AMD MI100 128 640
Borg MG + FG AMD MI250 64 (128x GCD) 386

Credits

  • Chroma-QUDA multigrid HMC developed jointly by Kate Clark (NVIDIA) and Bálint Joó (ORNL)
  • Titan, Summit, Spock and Borg results: Bálint Joó (ORNL)
  • Juelich Booster and Selene results: Mathias Wagner (NVIDIA)
  • qdpjit: Frank Winter (Jlab)
  • Chroma: Robert Edwards (Jlab) and Bálint Joó (ORNL)
  • Chroma's force-gradient integrator implemented by Boram Yoon (Los Alamos)

Spock and Borg results computed from speedup numbers here relative to Titan baseline, accounting for reduction in numbers of GPUs. For example, the Borg number is computed as (4006/166)*(1024/64) = 386 seconds.