Chroma ECP - lattice/quda GitHub Wiki

The Chroma ECP HMC benchmark concerns running a 2+1 flavor Stout-improved clover fermion simulation. All solves are offloaded from Chroma and run in QUDA, with everything else run on GPUs using qdpjit.

Monte Carlo Algorithm Details

The two-flavor determinant contribution is preconditioned using three levels of Hasenbusch mass preconditioning
The one-flavor determinant contribution is evaluated using RHMC
A two-level time integration is used, with a fourth-order force-gradient integrator deployed*. The pure-gauge contribution and heaviest two-flavor fermionic contributions are on the fine timescale, with all fermionic contributions on the coarse timescale.

* The original Titan-baseline and Summit-baseline results used a minimum-norm Omelyan second-order integrator.

Solver Details

The two-flavor solves all utilize QUDA's adaptive multigrid algorithm**, where the null-space is computed using the light mass and applied to all heavier solves. The outer solver is single-precision GCR, with double-precision defect correction employed. The multigrid preconditioner is mostly run in half precision, with strategic use of fixed-point int32 precision to ensure determinism. On architectures that support it, tensor-core acceleration is applied in the multigrid setup phase.
The one-flavor solve utilizes a mixed-precision multi-shift CG algorithm, where the multi-shift solver is run in double-single precision, with per-shift refinement applied in double-half precision.

** The original Titan-baseline and Summit-baseline results used an additive Schwarz preconditioner instead of adaptive multigrid.

Results

Machine	algorithm	GPU	#GPU	Time (s)
Titan	baseline	NVIDIA Tesla K20X	1024	4006
Titan	MG + FG	NVIDIA Tesla K20X	512	974
Summit	baseline	NVIDIA Tesla V100	128	1878
Summit	MG + FG	NVIDIA Tesla V100	128	329
Juelich booster	MG + FG	NVIDIA A100 SXM	64	285
Juelich booster	MG + FG	NVIDIA A100 SXM	128	166
Selene	MG + FG	NVIDIA A100 SXM	64	241
Selene	MG + FG	NVIDIA A100 SXM	128	150
Spock	MG + FG	AMD MI100	64	973
Spock	MG + FG	AMD MI100	128	640
Borg	MG + FG	AMD MI250	64 (128x GCD)	386

Credits

Chroma-QUDA multigrid HMC developed jointly by Kate Clark (NVIDIA) and Bálint Joó (ORNL)
Titan, Summit, Spock and Borg results: Bálint Joó (ORNL)
Juelich Booster and Selene results: Mathias Wagner (NVIDIA)
qdpjit: Frank Winter (Jlab)
Chroma: Robert Edwards (Jlab) and Bálint Joó (ORNL)
Chroma's force-gradient integrator implemented by Boram Yoon (Los Alamos)

Spock and Borg results computed from speedup numbers here relative to Titan baseline, accounting for reduction in numbers of GPUs. For example, the Borg number is computed as (4006/166)*(1024/64) = 386 seconds.