ATPESC Minutes, 29 July Day 4 Late Morning - ahmadia/atpesc-2013 GitHub Wiki
Performance in the HACC cosmology framework, Hal Finkel
- Recipe for designing application that helps you apply to your own codes
- Address issues that arose, team effort
Recipe 1 - Separation of Scales
-
Huge n-body problem, trillions of particles
-
Even with modern supercomputers, would be impossible
-
Use a FFT-based particle-mesh technique, but still too hard due to massive dynamic range
-
Solution: Separation of scales. Use FFT-based particle-mesh technique for as much as possible, use some less memory-hungry technique for the finer scales. Plus, longer spatial scales have longer characteristic time scales, so we can "subcycle" the smaller scale computations relative to the long-range force computations. The short scale computations are now rank-local.
We can write ( f(r_1-r_2) ) as ( f_{long}(r_1-r_2) + f_{short}(r_1-r_2) )
General Introduction to Fast Multipole Goes Here
- long-range forces are sampled empirically, then fit to a polynomial
- short-range forces are computed directly, and are computationally very expensive
Particle-Particle Particle-Mesh algorithm on accelerated systems (RCB tree) Tree Particle-Particle Particle-Mesh method on CPU-only systmes.
Force Splitting
- The gravitational force calculation is split into long-rate part and a short-range part
- A grid is responsible for largest 4 orders of magnitude of dynamic range
- particle methods handle the critical 2 orders of magnitude at the shortest scales
Long-Range Algorithm:
- The long/medium range algorithm is based on a fast, spectrally filtered PM method
- The density field is generated from the particles using a Cloud-in-Cell scheme
- The "Poisson-solve" is the composition of all the kernels above in one single Fourier transform
- Each component of the potential field gradient then requires an independent FFT
- Distributed FFTs use a pencil decomposition
- To obtain the short-range force, the filtered grid force is subtracted from the Newtonian force
Mixed precision:
- single precision is adequate for the short/close-range particle force evaluations and particle-time stepping
- double precision is used for the spectral component
Mixed precision doesn't help you much on BG/Q because FPUs are double-precision, but still helps with bandwidth.
- Peak performance on Titan is 3x higher for single-precision over double-precision.
Overloading
-
The spatial domain decomposition is in regular 3-D blocks, but unlike the guard zones of a typical PM method, full particle replication - termed 'particle overloading' - is employed across domain boundaries.
-
Structures don't move quickly on simulated time scales, allows you to avoid refreshing boundary conditions every time step. Physics-driven.
Time-Stepping
2nd-order symplectic time stepping (SKS)
RCB Tree
The short-range force is computed using recursive coordinate bisection (RCB) tree in conjunction with a highly-tuned short-range polynomial force kernel.
Tree is rebuilt on every time step, roughly ~2% of overall time is spent on force calculations.
You get a certain amount of flop/s for free, you should always use those.