ATPESC Minutes, 29 July Day 4 Mid Morning - ahmadia/atpesc-2013 GitHub Wiki
Performance and Portability Lessons from HACC, Salman Habib
- High Performance Computing is not a solved problem.
- You are trying to solve a new problem, you are not trying to port
- First question: Do you really need it?
- Places supercomputer can't help with (allegedly), ODE with billions of time steps
- Make sure you understand the global science problems being addressed. There's no replacement for domain knowledge. Crucial issue for software developers.
- Given human constraints (and Moore's law), it is not usually worth it to go for the last factor of two, but there are exceptions - HACC is one.
- Obtaining performance is painful, so design for the future -- what can you rely on, what can disappear, what can change, what can break -- the more parameters you can control, the better -- HPC systems are not your laptop: Learn from experience
- Many things can change under you, and you can't control them.
- Vectorize everything
- Don't try to be the performance guru for every machine, badger the performance gurus for your machine
Portability (assuming you are developing new code)
-
Three scales of code development
- individual
- small team
- big team to open source
-
Compute environment (diversity of environments and software)
- small-scale - individual PI, low diversity hardware
- medium-scale
- large-scale - multiple projects, very diverse hardware
Step 1 - consider which categories your situation falls into, this will help set the portability constraints.
It is important to be ruthless. Whatever doesn't work, throw away as fast as you possibly can.
Simplicity is good. Google's programming paradigm is based on simplicity, avoiding non-functional.
- Design for the future. Software lifecycles should be long, but often are not
Think like UNIX. Small modules working together.
- Performance and portability are often in opposition, but they can be co-aligned, as in HACC.
What is HACC?
HACC (Hardware/Hybrid Accelerated Cosmology Code) Framework
- HACC does very large cosmological simulations
- Fastest scientific application on BlueGene/Q
Why HACC?
Sky surveys are generating incredible amount of information, need more cosmological forward models
- Highly non-linear problems
Simulating the Universe
Vlasov-Poisson equation with gravity
- 6-dimensional PDE, nasty problems, being solved for an instability
- Needs high-resolution everywhere in the problem domain
- Uses approximate N-body technique
N-Body Problem: Central Issues
-
Naive P-P ($O(n^2)$) is hopeless
-
Particle-Mesh: designed in the 50s, moved to other codes in 60s, now other fields
-
Next-Generation Architectures; Pile of PCs vs. Pile of Cell phones
HACC's Domain: The "Bleeding Edge"
-
Can the entire observable Universe be "stuffed" inside a supercomputer?
-
Can the Universe be run as a short computational "experiment"?
-
Computing Boundary Conditions
- Total memory in the PB+ class
- Performance in the 10+ PFLOP/s class
Outer Rim Run: Trillion+ Particles in a 'Box'
Insert cool video here.
Meeting the challenge: HACC on BG/Q and CPU/GPU Systems
Runs at 70% of peak, 90% parallel efficiency.
Co-Design vs. Code Design
HPC Myths
- There will be compilers to solve your problem
- There will be a DSL to solve your problem
- Special-purpose hardware
Dealing with Current HPC Reality
- Follow the architecture
- Know the boundary conditions
- There is no such thing as 'code port'
- Think out of the box
- Get the best team
- Work together
Opening the HACC Black Box: Design Principles
- Optimize Next-Generation Code 'Ecology'
- Framework design
- Performance
- assume 'on your own' for software support, but hook into tools as available
- Optimal Splitting of Gravitational Forces
- Compute to Communication Balance
- Time-stepping
- Force kernel
- Production Readiness
Splitting the Force: The Long-Range Solver
- Spectral Particle-Mesh Solver
- Short-range Force
- Pencil-decomposed Parallel 3-D FFT
- Time-stepping uses Symplectic Sub-cycling
Particle Overloading and Short-Range Solvers
- Particle Overloading allows short-range solver to be completely local
- Short-range Force
- Error tests
- Can directly compare different short-range solver algorithms