CUDA Reading List - yszheda/wiki GitHub Wiki

nvidia devblogs reading list

Performance Measure

A problem with using host-device synchronization points, such as cudaDeviceSynchronize(), is that they stall the GPU pipeline.

Memory Bandwidth

THEORETICAL BANDWIDTH

Theoretical bandwidth can be calculated using hardware specifications available in the product literature. For example, the NVIDIA Tesla M2050 GPU uses DDR (double data rate) RAM with a memory clock rate of 1,546 MHz and a 384-bit wide memory interface. Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2050 is 148 GB/sec, as computed in the following.

BW_Theoretical = 1546 * 10^6 * (384/8) * 2 / 10^9 = 148 GB/s

In this calculation, we convert the memory clock rate to Hz, multiply it by the interface width (divided by 8, to convert bits to bytes) and multiply by 2 due to the double data rate. Finally, we divide by 109 to convert the result to GB/s.

EFFECTIVE BANDWIDTH

We calculate effective bandwidth by timing specific program activities and by knowing how our program accesses data. We use the following equation.

BW_Effective = (R_B + W_B) / (t * 10^9)

Here, BW_Effective is the effective bandwidth in units of GB/s, R_B is the number of bytes read per kernel, W_B is the number of bytes written per kernel, and t is the elapsed time given in seconds.

Computational Throughput

A common measure of computational throughput is GFLOP/s, which stands for “Giga-FLoating-point OPerations per second”, where Giga is that prefix for 10^9.


Dynamic Parallelism

A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size.

Grids launched with dynamic parallelism are fully nested. This means that child grids always complete before the parent grids that launch them, even if there is no explicit synchronization

void threadBlockDeviceSynchronize(void) {
  __syncthreads();
  if(threadIdx.x == 0) {
    cudaDeviceSynchronize();
  }
  __syncthreads();
}

This means that if the parent writes to a location, and then launches a child grid, the child is guaranteed to see the value actually written by the parent. Similarly, if the child writes a memory location, and the parent performs synchronization, the parent is guaranteed to see the value written by the child. This also means that if several child grids are executed sequentially (for example in the same stream), then any writes performed by earlier child grids are seen by child grids started later, even if no synchronization has occurred between them.

Note that the view of global memory is not consistent when the kernel launch construct is executed. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.

Passing Pointers to Child Grids

Can be passed Cannot be passed
* global memory (incl. __device__ variables and malloc’ed memory) * shared memory (__shared__ variables)
* zero-copy host memory * local memory (incl. stack variables)
* constant memory (inherited and not writeable)

Recursion Depth and Device Limits

  • nesting depth, which is the deepest nesting level of recursive grid launches, with kernels launched from the host having depth 0;
  • synchronization depth, which is the deepest nesting level at which cudaDeviceSynchronize() is called.

Warp

Reduce Shared Memory Bank Conflict

C++11

thrust

review

warp

memory

bank conflict

sgemm

Examples

Gaussian Blur

error-checking

matrix multiplication

CUDA and OpenGL

Fermi Architecture

cuDNN

restrict