Notes on GPU programming - ProkopHapala/FireCore GitHub Wiki

Optimize (local) kernel sizes to efficiently use registers and local memory

How many registers GPU has ?
CUDA Compute Capability Specification (number of registers and local memory)
- From the documentation, A100 SM has 64K = 65536 registers, 2048 threads per SM, and the number of registers per thread is limited to 255. source
AMD_OpenCL_Programming_Optimization_Guide
- 2.6.2.2 Specifying the Default Work-Group Size at Compile-Time The number of registers used by a work-item is determined by the compiler on compile time. The user later specifies the size of the work-group. Ideally, the OpenCL compiler knows the size of the work-group at compile-time, so it can make optimal register allocation decisions. Without knowing the work-group size, the compiler must assume an upper-bound size to avoid allocating more registers in the work-item than the hardware actually contains. OpenCL provides a mechanism to specify a work-group size that the compiler can use to optimize the register allocation. In particular, specifying a smaller work-group size at compile time allows the compiler to allocate more registers for each kernel, which can avoid spill code and improve performance. The kernel attribute syntax is: __attribute__((reqd_work_group_size(X, Y, Z))) Section 6.7.2 of the OpenCL specification explains the attribute in more detail So, to have 2048 threads per SM, they must use only up to 32 registers. If you only have 256 threads per SM, they can use all 255 registers.`
FPGA SDK for OpenCL: 6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size
- Specify the max_work_group_size or reqd_work_group_size attribute for your kernels whenever possible. These attributes allow the to perform aggressive optimizations to match the kernel to hardware resources without any excess logic. The offline compiler assumes a default work-group size for your kernel depending on certain constraints imposed during compilation time and runtime. The offline compiler imposes the following constraints at compilation time:
  - If you specify a value for the reqd_work_group_size attribute, the work-group size must match this value.
  - If you specify a value for the max_work_group_size attribute, the work-group size must not exceed this value.
  - If you do not specify values for reqd_work_group_size and max_work_group_size, and the kernel contains a barrier, the offline compiler defaults to a maximum work-group size of 256 work-items.
  - If you do not specify values for both attributes and the kernel does not contain any barrier, the offline compiler does not impose any constraint on the work-group size at compilation time.`

__attribute__((vec_type_hint(<type>)))
__attribute__((work_group_size_hint(X, Y, Z)))
__attribute__((reqd_work_group_size(X, Y, Z)))
__attribute__((nosvm))

Notes on GPU programming - ProkopHapala/FireCore GitHub Wiki

Optimize (local) kernel sizes to efficiently use registers and local memory

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️