CUDA C Best Practices Guide - takamatsu-shyo/cuda_memo GitHub Wiki
The following example is based on gprof,
which is an open-source profiler for Linux platforms from the GNU Binutils collection.
" Strong Scaling and Amdahl’s Law describes strong scaling, which allows us to set an upper bound for the speedup with a fixed problem size. Weak Scaling and Gustafson’s Law describes weak scaling, where the speedup is attained by growing the problem size. In many applications, a combination of strong and weak scaling is desirable. "
Strong scaling is a measure of how, for a fixed overall problem size,
the time to solution decreases as more processors are added to a system.
Weak scaling is a measure of how the time to solution changes
as more processors are added to a system with a fixed problem size per processor;
i.e., where the overall problem size increases as the number of processors is increased.
Thrust provides a rich collection of data parallel primitives
such as scan, sort, and reduce, which can be composed together
to implement complex algorithms with concise, readable source code.
The OpenACC standard provides a set of compiler directives to specify loops
and regions of code in standard C, C++ and Fortran
that should be offloaded from a host CPU to an attached accelerator such as a CUDA GPU.
if most functions are defined as __host__ __device__ rather than just __device__ functions,
then these functions can be tested on both the CPU and the GPU, thereby increasing our confidence
Therefore, to accurately measure the elapsed time for a particular call or sequence of CUDA calls,
it is necessary to synchronize the CPU thread with the GPU by calling cudaDeviceSynchronize()
immediately before starting and stopping the CPU timer.
cudaDeviceSynchronize() blocks the calling CPU thread until all CUDA calls previously issued by the thread are completed.
cudaStreamSynchronize() blocks the CPU thread until all CUDA calls previously issued into the given stream have completed.
cudaEventSynchronize() blocks until a given event in a particular stream has been recorded by the GPU.
Requested Global Load Throughput
Requested Global Store Throughput
Global Load Throughput
Global Store Throughput
DRAM Read Throughput
DRAM Write Throughput
It’s important to note that both numbers are useful.
The actual memory throughput shows how close the code is to the hardware limit,
and a comparison of the effective or requested bandwidth to the actual bandwidth
presents a good estimate of how much bandwidth is wasted by suboptimal coalescing of memory accesses
The --ptxas options=v option of nvcc details the number of registers
used per thread for each kernel.
See Hardware Multithreading of the CUDA C++ Programming Guide for the register allocation formulas
for devices of various compute capabilities and Features
and Technical Specifications of the CUDA C++ Programming Guide for the total number of registers available on those devices.
Medium Priority: The number of threads per block should be a multiple of 32 threads,
because this provides optimal computing efficiency and facilitates coalescing.
- Threads per block should be a multiple of warp size to avoid wasting computation
on under-populated warps and to facilitate coalescing.
- A minimum of 64 threads per block should be used,
and only if there are multiple concurrent blocks per multiprocessor.
- Between 128 and 256 threads per block is a good initial range
for experimentation with different block sizes.
- Use several smaller thread blocks rather than one large thread block per multiprocessor
if latency affects performance.
This is particularly beneficial to kernels that frequently call __syncthreads().
The CUDA Toolkit Samples provide several helper functions for error checking with the various CUDA APIs;
these helper functions are located in the samples/common/inc/helper_cuda.h file in the CUDA Toolkit.
The NVIDIA Management Library (NVML) is a C-based interface
that provides direct access to the queries and commands exposed via nvidia-smi
intended as a platform for building 3rd-party system management applications.
Performance optimization revolves around three basic strategies:
- Maximizing parallel execution
- Optimizing memory usage to achieve maximum memory bandwidth
- Optimizing instruction usage to achieve maximum instruction throughput
NVCC Options
It supports a number of command-line parameters,
of which the following are especially useful for optimization and related best practices:
-maxrregcount=N specifies the maximum number of registers kernels can use at a per-file level.
See [Register Pressure](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#register-pressure).
(See also the__launch_bounds__ qualifier discussed in Execution Configuration
of the CUDA C++ Programming Guide to control the number of registers used on a per-kernel basis.)
--ptxas-options=-v or -Xptxas=-v lists per-kernel register, shared, and constant memory usage.
-ftz=true (denormalized numbers are flushed to zero)
-prec-div=false (less precise division)
-prec-sqrt=false (less precise square root)
-use_fast_math compiler option of nvcc coerces every functionName() call to the equivalent __functionName() call.
This makes the code run faster at the cost of diminished precision and accuracy.
See [Math Libraries](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#math-libraries).