setup conserve interp gpu - NOAA-GFDL/FRE-NCtools GitHub Wiki

Setup_conserve_interp_gpu populates the Interp struct that holds the exchange grid information. The Interp struct is used in do_scalar_conserve_interp_gpu for data remapping. If a remap file has been provided by the user, setup_conserve_interp_gpu reads in the remap file and transfers the data to GPU before continuing on to . If a remap is not provided, setup_conserve_interp_gpu transfers the grid to GPU, creates the exchange grid, populates the Interp struct, and writes out the remap file for future use.

In the CPU and GPU version of setup_conserve_interp, create_xgrid_2dx2d_order1(2) and create_xgrid_2dx2d_order1(2)_gpu are the key functions that are invoked for exchange grid generation. For GPU offloading, OpenACC directives need to be applied to manage data transfer and to parallelize the for loops. Parallelizing the double loop structure that iterates over all output grid cell for each input grid cells posed challenges due to race conditions. The number of exchange grid cells is not known beforehand and cells are counted as they are formed. The variable nxgrid keeps track of the count and is incremented for each formation as can be seen here. The value is also used as the array index to store that particular exchange grid cell information such as the parent cell indices and exchange grid cell area. At the end, nxgrid represents the total number of exchange grid cells. In the CPU version, the necessary arrays are malloc-ed in advance to be the size of MAXXGRID, where MAXXGRID is a macro set to a default value of 1e6 for MPI parallelization and 5e6 for serial and OpenMP without MPI paralleization. During create_xgrid, if the number of exchange grid cells exceed MAXXGRID, the fregrid exits with an error.

Simply adding OpenACC loop directives to create_xgrid for the GPU version of the algorithm (create_xgrid_gpu) leads to inter-thread dependency issues when multiple threads will increment nxgrid simultaneously. OpenACC provides the reduction clause to overcome this issue, and with reduction(+:nxgrid), each thread increments its private copy of nxgrid that is initially set to be zero. At the end of the OpenACC kernel, nxgrid is the sum over all nxgrid values from each thread. This strategy will compute the correct number of exchange grid cells. However, nxgrid is also used as the array index to store exchange grid cell information as they are formed. With the reduction clause, information will not be stored correctly in the shared arrays when threads overwrite array elements when threads have the same nxgrid value. Alternatively, the atomic directive can be applied to resolve the race condition. However this implementation results in significant performance loss.

Thus, a new algorithm was developed for create_xgid_2dx2d. In this new algorithm, get_upbound_nxgrid_2dx2d is called first to get the approximate number of exchange grid cells that is formed from input cell $$ij1$$ (approx_nxcells_per_ij1), the index of the first output grid cell that overlaps with $$ij1$$ (ij2_start), and the index of the last output grid cell that overlaps with $$ij1$$ (ij2_end). These values can be correctly computed with the reduction clause specified for the nested loop iterating over all output cells for each input cells. From approx_nxcells_per_ij1, an estimated value of nxgrid (which turns out to be an upper bound to nxgrid) can be deduced and this value can be used to malloc the arrays that will hold the exchange grid information. Fregrid_gpu will not terminate prematurely due to the MAXXGRID error as described above Cumulative sum of approx_nxcells_per_ij1 for each input cell $$ij1$$ will also estimate the index of the first exchange grid cell that is formed from cell $$ij1$$. This information will allow parallelization of the outer loop iterating over all input cells but not the inner loop iterating over all the output cells in create_xgrid_2dx2d_gpu. However, the inner loop iteration is limited to a small portion of the output grid starting from cell index ij2_start[ij1] and end with cell index ij2_end[ij1] to reduce computational time. Nxgrid is incremented sequentially for each thread and exchange grid information is safely stored using the estimated indices obtained from the cumulative sum mentioned above.

for each output grid tile (otile)

  • copy_grid_to_device_gpu: copy longitudinal and latitudinal grid points for otile to GPU

  • get_grid_cell_struct_gpu: compute output grid cell vertices, area, longitudinal "center" leftmost (lon_min), rightmost (lon_max), bottommost (lat_min), and topmost (lat_max) vertices; and store information in the output_grid_cells struct. These values are used to check if the input and output grid cells overlap and are determined in advance for computational efficiency. Such information is generated on-the-fly for the input grid.

    for each input grid tile (itile)

  • delete output array pointers on device: only the Interp struct remains on the device now.

  • write_remap_file : update Interp struct on CPU and write remap file.

  • check_area_conservation: check to see if the exchange grid covers the entire output grid to within a tolerance. Conservation is checked if the user specifies the --check_conserve as an input argument.

⚠️ **GitHub.com Fallback** ⚠️