GPU Load balancing Abstraction for Irregular Computations - gunrock/essentials GitHub Wiki

An open-source GPU load-balancing framework for applications that exhibit irregular parallelism. The set of applications and algorithms we consider are fundamental to computing tasks ranging from sparse machine learning, large numerical simulations, and on through to graph analytics. The underlying data and data structures that drive these applications to exhibit access patterns that naturally don't map well to the GPU's architecture that is designed with dense and regular access patterns in mind. Prior to work we present and propose here, the only way to unleash the GPU's full power on these problems has been to workload balance through tightly coupled load-balancing techniques. Our proposed load-balancing abstraction decouples load-balancing from work processing. It aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules in the future. With our open-source framework, we hope to improve programmers' productivity when developing irregular-parallel algorithms on the GPU and improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Consequently, we also hope that by separating the concerns of load-balancing from work processing within our abstraction, managing and extending existing code to future architectures becomes easier.

image illustration

Load-balanced Execution APIs

We provide two APIs for our library, one that focuses on a beginner-friendly approach to load-balancing irregular sparse computations and another that allows advanced programmers to retain control of the GPU kernels and express load-balanced execution as ranged loops. Both approaches are highlighted below.

Beginner API: Load-balanced Transformations and Primitives

Our Load-balanced execution API builds on the approach defined in gunrock/essentials where we identify key primitives used in computing sparse linear algebra, graph analytics, and other irregular computations alike. Load-balanced versions of these primitives are then implemented, such that the user gets access to the atom, tile, and processor id they are working on as the C++ lambda signature.

Users define their computation within the C++ lambda, which gets called by the load-balanced primitive for every instance of the work atom.

(1) Sparse Layout

In this simple example we are using Compressed Sparse Row (CSR) format and simply returning the number of atoms (nonzeros) in each row as our layout.

auto layout = [=] __device__ (std::size_t tile_id) {
  return offsets[tile_id+1] – offsets[tile_id];
}

(2) User-defined Compute (C++ Lambda)

// user-defined compute: y = Ax
auto spmv = [=] __device__ (std::size_t atom_id, std::size_t tile_id, std::size_t proc_id) {
  return values[atom_id] * 	x[column_indices[atom_id]];
}

(3) Load-balanced primitive (Transform Segmented Reduce)

Requires the load-balancing schedule (work_oriented in this example) as a templated parameter. The transformation as a C++ lambda expressions (spmv) and the layout as an input to perform a segmented reduction. The output of the C++ lambda expression gets reduced by segments defined using A.offsets.

lb::transform_segreduce<lb::work_oriented>
                       (spmv, layout, A.nonzeros, A.offsets,
                        G.rows, y, lb::plus_t(),
                        0.0f, stream);

Advantage:

  • Requires no knowledge of how to implement segmented reduction,
  • Very simple API if the computation can be defined using C++ lambda expressions.

Disadvantages:

  • No control over kernel execution and dispatch configuration,
  • No composability; cannot implement more complicated computations that may have cooperative properties among processors.

Advanced API: Load-balanced Loops

SpMV problem-specific kernel parameters.

template <std::size_t threads_per_block,
          typename index_t,
          typename offset_t,
          typename type_t>
__global__ void __launch_bounds__(threads_per_block, 2)
          spmv(std::size_t rows,
               std::size_t cols,
               std::size_t nnz,
               offset_t* offsets,
               index_t* indices,
               const type_t* values,
               const type_t* x,
               type_t* y) {

(1) Define and Configure Load-balancing Schedule

Allocates any temporary memory required for load-balancing, as well as constructs a schedule per processors partition (defined using cooperative groups).

  using setup_t = schedule::setup<schedule::algroithms_t::tile_mapped,
                                  threads_per_block, 32, index_t, offset_t>;

  /// Allocate temporary storage for the schedule.
  using storage_t = typename setup_t::storage_t;
  __shared__ storage_t temporary_storage;

  /// Construct the schedule.
  setup_t config(temporary_storage, offsets, rows, nnz);
  auto p = config.partition();

(2) Load-balanced Ranged Loops (see; C++ ranges)

In this example, we define two iteration spaces; virtual and real. Virtual spaces allow us to balance atoms and tiles onto the processor ids and link directly to the real iteration space, which returns the exact atom or tile being processed. The code below loops over all balanced number of atoms fetches the tile corresponding to the atom being processed and allows user to define their computation.

  for (auto virtual_atom : config.atom_accessor(p)) {
    auto virtual_tile = config.tile_accessor(virtual_atom, p);

    if (!(config.is_valid_accessor(virtual_tile, p)))
      continue;

    auto row = config.tile_id(virtual_tile, p);

    auto nz_idx = config.atom_id(virtual_atom, row, virtual_tile, p);

(3) User-defined Computation

Once the user has access to the atom, tile, and the processor id, they implement the desired computation on the given tuple. In this example, we use a simple atomicAdd to perform SpMV (can be improved).

    atomicAdd(&(y[row]), values[nz_idx] * x[indices[nz_idx]]);
  }
}

Another example...

template <typename index_t,
          typename offset_t,
          typename type_t>
__global__ void spmv(const std::size_t rows,
                     const std::size_t cols,
                     const std::size_t nnz,
                     const offset_t* offsets,
                     const index_t* indices,
                     const type_t* values,
                     const type_t* x,
                     type_t* y) {
  /// Equivalent to:
  /// row = blockIdx.x * blockDim.x + threadIdx.x; (init)
  /// row < rows; (boundary condition)
  /// row += gridDim.x * blockDim.x. (step)
  for (auto row : config.tiles()) {
    type_t sum = 0;

    /// Equivalent to:
    /// for (offset_t nz = offset; nz < end; ++nz)
    for (auto nz : config.atoms(row)) {
      sum += values[nz] * x[indices[nz]];
    }

    // Output
    y[row] = sum;
  }
}