CUDA C Best Practices Guide - takamatsu-shyo/cuda_memo GitHub Wiki

The following example is based on gprof, 
which is an open-source profiler for Linux platforms from the GNU Binutils collection.

" Strong Scaling and Amdahl’s Law describes strong scaling, which allows us to set an upper bound for the speedup with a fixed problem size. Weak Scaling and Gustafson’s Law describes weak scaling, where the speedup is attained by growing the problem size. In many applications, a combination of strong and weak scaling is desirable. "

Strong scaling is a measure of how, for a fixed overall problem size, 
the time to solution decreases as more processors are added to a system.

Weak scaling is a measure of how the time to solution changes 
as more processors are added to a system with a fixed problem size per processor; 
i.e., where the overall problem size increases as the number of processors is increased.

Thrust provides a rich collection of data parallel primitives 
such as scan, sort, and reduce, which can be composed together 
to implement complex algorithms with concise, readable source code.

The OpenACC standard provides a set of compiler directives to specify loops 
and regions of code in standard C, C++ and Fortran 
that should be offloaded from a host CPU to an attached accelerator such as a CUDA GPU.

if most functions are defined as __host__ __device__ rather than just __device__ functions, 
then these functions can be tested on both the CPU and the GPU, thereby increasing our confidence

Therefore, to accurately measure the elapsed time for a particular call or sequence of CUDA calls, 
it is necessary to synchronize the CPU thread with the GPU by calling cudaDeviceSynchronize() 
immediately before starting and stopping the CPU timer. 

cudaDeviceSynchronize() blocks the calling CPU thread until all CUDA calls previously issued by the thread are completed.

cudaStreamSynchronize() blocks the CPU thread until all CUDA calls previously issued into the given stream have completed. 

cudaEventSynchronize() blocks until a given event in a particular stream has been recorded by the GPU.

    Requested Global Load Throughput
    Requested Global Store Throughput
    Global Load Throughput
    Global Store Throughput
    DRAM Read Throughput
    DRAM Write Throughput

It’s important to note that both numbers are useful. 
The actual memory throughput shows how close the code is to the hardware limit, 
and a comparison of the effective or requested bandwidth to the actual bandwidth 
presents a good estimate of how much bandwidth is wasted by suboptimal coalescing of memory accesses

The --ptxas                                  options=v option of nvcc details the number of registers
 used per thread for each kernel. 
See Hardware Multithreading of the CUDA C++ Programming Guide for the register allocation formulas 
for devices of various compute capabilities and Features 
and Technical Specifications of the CUDA C++ Programming Guide for the total number of registers available on those devices.

Medium Priority: The number of threads per block should be a multiple of 32 threads, 
because this provides optimal computing efficiency and facilitates coalescing.

- Threads per block should be a multiple of warp size to avoid wasting computation 
  on under-populated warps and to facilitate coalescing.

- A minimum of 64 threads per block should be used, 
  and only if there are multiple concurrent blocks per multiprocessor.

- Between 128 and 256 threads per block is a good initial range 
  for experimentation with different block sizes.

- Use several smaller thread blocks rather than one large thread block per multiprocessor 
  if latency affects performance. 
  This is particularly beneficial to kernels that frequently call __syncthreads().

The CUDA Toolkit Samples provide several helper functions for error checking with the various CUDA APIs; 
these helper functions are located in the samples/common/inc/helper_cuda.h file in the CUDA Toolkit.

The NVIDIA Management Library (NVML) is a C-based interface 
that provides direct access to the queries and commands exposed via nvidia-smi 
intended as a platform for building 3rd-party system management applications.

Performance optimization revolves around three basic strategies:

- Maximizing parallel execution
- Optimizing memory usage to achieve maximum memory bandwidth
- Optimizing instruction usage to achieve maximum instruction throughput

NVCC Options

It supports a number of command-line parameters, 
of which the following are especially useful for optimization and related best practices:

-maxrregcount=N specifies the maximum number of registers kernels can use at a per-file level. 
 See [Register Pressure](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#register-pressure). 
(See also the__launch_bounds__ qualifier discussed in Execution Configuration 
 of the CUDA C++ Programming Guide to control the number of registers used on a per-kernel basis.)

 --ptxas-options=-v or -Xptxas=-v lists per-kernel register, shared, and constant memory usage.
 -ftz=true (denormalized numbers are flushed to zero)
 -prec-div=false (less precise division)
 -prec-sqrt=false (less precise square root)
 -use_fast_math compiler option of nvcc coerces every functionName() call to the equivalent __functionName() call. 
  This makes the code run faster at the cost of diminished precision and accuracy. 
  See [Math Libraries](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#math-libraries).