setup conserve interp gpu - NOAA-GFDL/FRE-NCtools GitHub Wiki
Setup_conserve_interp_gpu populates the Interp struct that holds the exchange grid information. The Interp struct is used in do_scalar_conserve_interp_gpu for data remapping. If a remap file has been provided by the user, setup_conserve_interp_gpu reads in the remap file and transfers the data to GPU before continuing on to . If a remap is not provided, setup_conserve_interp_gpu transfers the grid to GPU, creates the exchange grid, populates the Interp struct, and writes out the remap file for future use.
In the CPU and GPU version of setup_conserve_interp, create_xgrid_2dx2d_order1(2) and create_xgrid_2dx2d_order1(2)_gpu are the key functions that are invoked for exchange grid generation. For GPU offloading, OpenACC directives need to be applied to manage data transfer and to parallelize the for loops. Parallelizing the double loop structure that iterates over all output grid cell for each input grid cells posed challenges due to race conditions. The number of exchange grid cells is not known beforehand and cells are counted as they are formed. The variable nxgrid keeps track of the count and is incremented for each formation as can be seen here. The value is also used as the array index to store that particular exchange grid cell information such as the parent cell indices and exchange grid cell area. At the end, nxgrid represents the total number of exchange grid cells. In the CPU version, the necessary arrays are malloc-ed in advance to be the size of MAXXGRID, where MAXXGRID is a macro set to a default value of 1e6 for MPI parallelization and 5e6 for serial and OpenMP without MPI paralleization. During create_xgrid, if the number of exchange grid cells exceed MAXXGRID, the fregrid exits with an error.
Simply adding OpenACC loop directives to create_xgrid for the GPU version of the algorithm (create_xgrid_gpu) leads to inter-thread dependency issues when multiple threads will increment nxgrid simultaneously. OpenACC provides the reduction clause to overcome this issue, and with reduction(+:nxgrid), each thread increments its private copy of nxgrid that is initially set to be zero. At the end of the OpenACC kernel, nxgrid is the sum over all nxgrid values from each thread. This strategy will compute the correct number of exchange grid cells. However, nxgrid is also used as the array index to store exchange grid cell information as they are formed. With the reduction clause, information will not be stored correctly in the shared arrays when threads overwrite array elements when threads have the same nxgrid value. Alternatively, the atomic directive can be applied to resolve the race condition. However this implementation results in significant performance loss.
Thus, a new algorithm was developed for create_xgid_2dx2d. In this new algorithm, get_upbound_nxgrid_2dx2d is called first to get the approximate number of exchange grid cells that is formed from input cell
for each output grid tile (otile)
-
copy_grid_to_device_gpu: copy longitudinal and latitudinal grid points for otile to GPU
-
get_grid_cell_struct_gpu: compute output grid cell vertices, area, longitudinal "center" leftmost (lon_min), rightmost (lon_max), bottommost (lat_min), and topmost (lat_max) vertices; and store information in the output_grid_cells struct. These values are used to check if the input and output grid cells overlap and are determined in advance for computational efficiency. Such information is generated on-the-fly for the input grid.
for each input grid tile (itile)
- copy_grid_to_device_gpu: copy longitudinal and latitudinal grid points for itile to GPU
- get_input_grid_mask_gpu: compute input grid cell mask on GPU. Input cell with mask value of 0 will be skipped during exchange grid generation.
- create_upbound_nxcells_arrays_on_device_gpu: create arrays approx_nxcells_per_ij1, ij2_start, and ij2_end on device. These arrays will respectively hold the approximate number of exchange grid cells generated from parent input cell ij1, the first output cell index that overlaps with parent cell ij1, and the last output cell index that overlaps with parent cell ij1.
- get_upbound_nxcells_2dx2d_gpu : see link
- create_xgrid_2dx2d_order1(2)_gpu : see link
- delete input array pointers on device
-
delete output array pointers on device: only the Interp struct remains on the device now.
-
write_remap_file : update Interp struct on CPU and write remap file.
-
check_area_conservation: check to see if the exchange grid covers the entire output grid to within a tolerance. Conservation is checked if the user specifies the --check_conserve as an input argument.