CUDA Reading List - yszheda/wiki GitHub Wiki
nvidia devblogs reading list
- https://devblogs.nvidia.com/parallelforall/tag/linear-algebra/
- https://devblogs.nvidia.com/parallelforall/tag/parallel-programming/
- https://github.com/parallel-forall/code-samples
Performance Measure
A problem with using host-device synchronization points, such as
cudaDeviceSynchronize()
, is that they stall the GPU pipeline.Memory Bandwidth
THEORETICAL BANDWIDTH
Theoretical bandwidth can be calculated using hardware specifications available in the product literature. For example, the NVIDIA Tesla M2050 GPU uses DDR (double data rate) RAM with a memory clock rate of 1,546 MHz and a 384-bit wide memory interface. Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2050 is 148 GB/sec, as computed in the following.
BW_Theoretical = 1546 * 10^6 * (384/8) * 2 / 10^9 = 148 GB/s
In this calculation, we convert the memory clock rate to Hz, multiply it by the interface width (divided by 8, to convert bits to bytes) and multiply by 2 due to the double data rate. Finally, we divide by 109 to convert the result to GB/s.
EFFECTIVE BANDWIDTH
We calculate effective bandwidth by timing specific program activities and by knowing how our program accesses data. We use the following equation.
BW_Effective = (R_B + W_B) / (t * 10^9)
Here, BW_Effective is the effective bandwidth in units of GB/s, R_B is the number of bytes read per kernel, W_B is the number of bytes written per kernel, and t is the elapsed time given in seconds.
Computational Throughput
A common measure of computational throughput is GFLOP/s, which stands for “Giga-FLoating-point OPerations per second”, where Giga is that prefix for 10^9.
Dynamic Parallelism
- Adaptive Parallel Computation with CUDA Dynamic Parallelism
- CUDA Dynamic Parallelism API and Principles
A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size.
Grids launched with dynamic parallelism are fully nested. This means that child grids always complete before the parent grids that launch them, even if there is no explicit synchronization
void threadBlockDeviceSynchronize(void) {
__syncthreads();
if(threadIdx.x == 0) {
cudaDeviceSynchronize();
}
__syncthreads();
}
This means that if the parent writes to a location, and then launches a child grid, the child is guaranteed to see the value actually written by the parent. Similarly, if the child writes a memory location, and the parent performs synchronization, the parent is guaranteed to see the value written by the child. This also means that if several child grids are executed sequentially (for example in the same stream), then any writes performed by earlier child grids are seen by child grids started later, even if no synchronization has occurred between them.
Note that the view of global memory is not consistent when the kernel launch construct is executed. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.
Passing Pointers to Child Grids
Can be passed | Cannot be passed |
---|---|
* global memory (incl. __device__ variables and malloc’ed memory) |
* shared memory (__shared__ variables) |
* zero-copy host memory | * local memory (incl. stack variables) |
* constant memory (inherited and not writeable) |
Recursion Depth and Device Limits
- nesting depth, which is the deepest nesting level of recursive grid launches, with kernels launched from the host having depth 0;
- synchronization depth, which is the deepest nesting level at which cudaDeviceSynchronize() is called.
Warp
Reduce Shared Memory Bank Conflict
- Randomized mapping: Rau, “Pseudo-randomly interleaved memory,” ISCA 1991
- Hash functions: V.d.Braak+, “Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs,” IEEE TC, 2016
C++11
thrust
review
- http://stackoverflow.com/tags/cuda/info
- http://stackoverflow.com/questions/1958320/interview-questions-on-cuda-programming
warp
memory
bank conflict
- http://stackoverflow.com/questions/3843032/why-arent-there-bank-conflicts-in-global-memory-for-cuda-opencl
- http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming
- CUDA SHARED MEMORY
- GPU 共享内存bank冲突(shared memory bank conflicts)
sgemm
Examples
Gaussian Blur
- 高斯滤波/高斯平滑/高斯模糊的实现及其快速算法(Gaussian Filter, Gaussian Smooth, Gaussian Blur, Fast implementation)
- GPU Recursive Filtering
error-checking
- http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
- http://cuda.it168.com/a2011/1221/1292/000001292129.shtml
matrix multiplication
- http://www.es.ele.tue.nl/~mwijtvliet/5KK73/?page=mmcuda
- http://stackoverflow.com/questions/9244747/cuda-matrix-multiplication-write-to-wrong-memory-location
- http://stackoverflow.com/questions/9250897/dynamic-matrix-multiplication-with-cuda
CUDA and OpenGL
-
http://stackoverflow.com/questions/6481123/cuda-and-opengl-interop
-
https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf