ATPESC Minutes, 29 July Day 4 Late Morning - ahmadia/atpesc-2013 GitHub Wiki

Performance in the HACC cosmology framework, Hal Finkel

  • Recipe for designing application that helps you apply to your own codes
  • Address issues that arose, team effort

Recipe 1 - Separation of Scales

  • Huge n-body problem, trillions of particles

  • Even with modern supercomputers, would be impossible

  • Use a FFT-based particle-mesh technique, but still too hard due to massive dynamic range

  • Solution: Separation of scales. Use FFT-based particle-mesh technique for as much as possible, use some less memory-hungry technique for the finer scales. Plus, longer spatial scales have longer characteristic time scales, so we can "subcycle" the smaller scale computations relative to the long-range force computations. The short scale computations are now rank-local.

We can write ( f(r_1-r_2) ) as ( f_{long}(r_1-r_2) + f_{short}(r_1-r_2) )

General Introduction to Fast Multipole Goes Here

  • long-range forces are sampled empirically, then fit to a polynomial
  • short-range forces are computed directly, and are computationally very expensive

Particle-Particle Particle-Mesh algorithm on accelerated systems (RCB tree) Tree Particle-Particle Particle-Mesh method on CPU-only systmes.

Force Splitting

  • The gravitational force calculation is split into long-rate part and a short-range part
  • A grid is responsible for largest 4 orders of magnitude of dynamic range
  • particle methods handle the critical 2 orders of magnitude at the shortest scales

Long-Range Algorithm:

  • The long/medium range algorithm is based on a fast, spectrally filtered PM method
  • The density field is generated from the particles using a Cloud-in-Cell scheme
  • The "Poisson-solve" is the composition of all the kernels above in one single Fourier transform
  • Each component of the potential field gradient then requires an independent FFT
  • Distributed FFTs use a pencil decomposition
  • To obtain the short-range force, the filtered grid force is subtracted from the Newtonian force

Mixed precision:

  • single precision is adequate for the short/close-range particle force evaluations and particle-time stepping
  • double precision is used for the spectral component

Mixed precision doesn't help you much on BG/Q because FPUs are double-precision, but still helps with bandwidth.

  • Peak performance on Titan is 3x higher for single-precision over double-precision.

Overloading

  • The spatial domain decomposition is in regular 3-D blocks, but unlike the guard zones of a typical PM method, full particle replication - termed 'particle overloading' - is employed across domain boundaries.

  • Structures don't move quickly on simulated time scales, allows you to avoid refreshing boundary conditions every time step. Physics-driven.

Time-Stepping

2nd-order symplectic time stepping (SKS)

RCB Tree

The short-range force is computed using recursive coordinate bisection (RCB) tree in conjunction with a highly-tuned short-range polynomial force kernel.

Tree is rebuilt on every time step, roughly ~2% of overall time is spent on force calculations.

You get a certain amount of flop/s for free, you should always use those.