Notes on Communication Patterns - parthenon-hpc-lab/parthenon GitHub Wiki

From the meeting on this topic:

Overview

There are multiple kinds of communication performed

Ghost halo exchange, which is expensive, requires lots of data, and is performed every cycle
Flux correction, less expensive, less data needed, performed every cycle
re-meshing and load balancing, performed only every re-mesh cycle.

Ghost-halo exchange

Our GPU-performant ghost halo exchange machinery is mostly in in src/bvals/cc/bvals_cc_in_one.*. MPI_IReceive is called first at the beginning of a cycle. Then MPI_Start and MPI_Test are used. Buffers are packed and unpacked in a single kernel.

Flux correction

Most relevant place to look is in src/bvals/cc/flux_correction_cc.cpp, which implements the functions in BoundaryBuffer in src/bvals/bvals_interfaces.hpp. We don't do anything special for GPUs here, other than put the calls in kokkos kernels. It hasn't been needed.

Re-meshing and load balancing

The load balancing calculation is done locally. Each rank recomputes the entire global AMR tree locally. Then the meshblocks are assigned to ranks by walking though a space-filling curve. The "cost" of blocks per rank is intended to be the same for each rank, which is effected via indexing logic. The only communication required is when moving blocks, as everything else is local. Moving blocks is, of course, very expensive. But re-meshing is only done once every several cycles, compared to the other operations, which are per-cycle.

The call stack is something like this. Most of this machinery lives in src/mesh/amr_loadbalance.cpp. However, a little bit of it lives in src/bvals/bvals_base.cpp.

LoadBalancingAndAdaptiveMeshRefinement wraps the two most important functions
- UpdateMeshBlockTree recomputes the AMR octree based on refinement criteria
- GatherCostListAndCheckBalance has an MPI_Allgather for the total load costs.
- RedistributeAndRefineMeshBlocks If refinement happened or a load imbalance is detected, move meshblocks, assign neighbors, etc. All done deterministically.
  - CalculateLoadBalance
    - AssignBlocks walks through the space-filling curve and assigns blocks to MPI ranks as the cost saturates per rank. Fills a a std::vector<int>, ranklist, which sets the rank of each meshblock, and the meshblocks will later use to find the ranks of their neighbors.
    - UpdateBlockList sets nslist and nblist which map MPI rank to the global ID of the first meshblock on that rank and total number of meshblocks on that rank respectively.
  - A bunch of machinery is called to generate buffers, copy meshblock data as needed, prolongate, restrict, etc. MPI communication is called here, on potentially a large amount of data to, e.g., move data on meshblocks to new ranks.
  - The neighbor list is finally generated/saved in pmb->pbval->SearchAndSetNeighbors, which uses the AMR octree, the ranklist, and nslist to walk through the tree locally. No MPI communication is needed.