Notes on Communication Patterns - parthenon-hpc-lab/parthenon GitHub Wiki
From the meeting on this topic:
There are multiple kinds of communication performed
- Ghost halo exchange, which is expensive, requires lots of data, and is performed every cycle
- Flux correction, less expensive, less data needed, performed every cycle
- re-meshing and load balancing, performed only every re-mesh cycle.
Our GPU-performant ghost halo exchange machinery is mostly in in src/bvals/cc/bvals_cc_in_one.*. MPI_IReceive is called first at the beginning of a cycle. Then MPI_Start and MPI_Test are used. Buffers are packed and unpacked in a single kernel.
Most relevant place to look is in src/bvals/cc/flux_correction_cc.cpp, which implements the functions in BoundaryBuffer in src/bvals/bvals_interfaces.hpp. We don't do anything special for GPUs here, other than put the calls in kokkos kernels. It hasn't been needed.
The load balancing calculation is done locally. Each rank recomputes the entire global AMR tree locally. Then the meshblocks are assigned to ranks by walking though a space-filling curve. The "cost" of blocks per rank is intended to be the same for each rank, which is effected via indexing logic. The only communication required is when moving blocks, as everything else is local. Moving blocks is, of course, very expensive. But re-meshing is only done once every several cycles, compared to the other operations, which are per-cycle.
The call stack is something like this. Most of this machinery lives in src/mesh/amr_loadbalance.cpp. However, a little bit of it lives in src/bvals/bvals_base.cpp.
-
LoadBalancingAndAdaptiveMeshRefinementwraps the two most important functions-
UpdateMeshBlockTreerecomputes the AMR octree based on refinement criteria -
GatherCostListAndCheckBalancehas anMPI_Allgatherfor the total load costs. -
RedistributeAndRefineMeshBlocksIf refinement happened or a load imbalance is detected, move meshblocks, assign neighbors, etc. All done deterministically.-
CalculateLoadBalance-
AssignBlockswalks through the space-filling curve and assigns blocks to MPI ranks as the cost saturates per rank. Fills a astd::vector<int>,ranklist, which sets the rank of each meshblock, and the meshblocks will later use to find the ranks of their neighbors. -
UpdateBlockListsetsnslistandnblistwhich map MPI rank to the global ID of the first meshblock on that rank and total number of meshblocks on that rank respectively.
-
- A bunch of machinery is called to generate buffers, copy meshblock data as needed, prolongate, restrict, etc. MPI communication is called here, on potentially a large amount of data to, e.g., move data on meshblocks to new ranks.
- The neighbor list is finally generated/saved in
pmb->pbval->SearchAndSetNeighbors, which uses the AMR octree, theranklist, andnslistto walk through the tree locally. No MPI communication is needed.
-
-