Chapter 02 CUDA programming model - SaoYan/Learning_CUDA GitHub Wiki

CUDA programming structure

  • allocate GPU mem
  • copy data from CPU to GPU
  • invode CUDA kernel
  • copy data from GPU back to CPU
  • destroy GPU mem

1. GPU memory and threads

1.1 API for GPU memory operation

1.2 GPU thread hierarchy abstraction and memory structure

  • Thread abstration: grid of blocks & block of threads

    • Grid: contains all threads spawned by a single kernal launch.
    • Block: contains a group if threads that can cooperate with each other. (threads in different blocks cannot communicate)
  • Memory structure

    • Shared memory: each block has its shared memory; can be accessed by threads within kernel
    • Global memory: all threads in a grid share the same global memory.
  • access block/thread within kernel

2. CUDA kernel

2.1 Writing CUDA kernel

  • function types in CUDA
    • The __device__ and __host__ qualifierscanbeusedtogether,inwhichcasethefunctionis compiled for both the host and the device.
    • __global__ functions can also be called from devices of compute capability 3.
    • Kernel is __global__

  • CUDA kernal must
    • access device memory only
    • has void return type
    • no support a variable number of argument
    • no support function pointers
    • be asynchronous

Example: adding two vector together

C function:

void sumArraysOnHost(float *A, float *B, float *C, const int N) {
    for (int i = 0; i < N; i++) {
        C[i] = A[i] + B[i];
    }
}

CUDA kernel (suppose grid(1), block(N)):

__global__ void sumArraysOnDevice(float *A, float *B, float *C) {
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

Lauching this kernel:

sumArraysOnDevice<<<1, N>>>(A, B, C);

2.2 Lauching CUDA kernel

kernel_name<<<grid, block>>>(args...)
  • A kernal call is asynchronous w.r.t the host thread; but we can but we can force the host to wait for the kernel to complete by calling cudaDeviceSynchronize()
  • Some CUDA run-time APIs are implicitly synchronized. For example cudaMemcpy(): data copy starts after all previous kernel calls are completed; the host must wait for the copy to complete

2.3 Debugging tricks

2.4 Timing the kernal with nvprof

nvprof [nvprof_args] <application> [application_args]

For example, running nvprof ./sumVectorsOnDevice gives the following outputs:

==5846== Profiling application: ./sumVectorsOnDevice
==5846== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   63.85%  174.39ms         2  87.194ms  87.103ms  87.285ms  [CUDA memcpy HtoD]
                   32.90%  89.840ms         1  89.840ms  89.840ms  89.840ms  [CUDA memcpy DtoH]
                    3.25%  8.8738ms         1  8.8738ms  8.8738ms  8.8738ms  sumArraysOnDevice(float*, float*, float*, int)
      API calls:   51.08%  264.39ms         3  88.129ms  87.172ms  89.922ms  cudaMemcpy
                   27.18%  140.66ms         3  46.886ms  681.16us  139.27ms  cudaMalloc
                   16.90%  87.469ms         1  87.469ms  87.469ms  87.469ms  cudaDeviceReset
                    2.68%  13.886ms         3  4.6288ms  677.44us  6.9015ms  cudaFree
                    1.73%  8.9369ms         1  8.9369ms  8.9369ms  8.9369ms  cudaDeviceSynchronize
                    0.17%  872.99us        94  9.2870us     508ns  376.16us  cuDeviceGetAttribute
                    0.16%  814.02us         1  814.02us  814.02us  814.02us  cudaGetDeviceProperties
                    0.08%  402.53us         1  402.53us  402.53us  402.53us  cuDeviceTotalMem
                    0.02%  115.16us         1  115.16us  115.16us  115.16us  cuDeviceGetName
                    0.00%  20.226us         1  20.226us  20.226us  20.226us  cudaLaunch
                    0.00%  15.084us         1  15.084us  15.084us  15.084us  cudaSetDevice
                    0.00%  6.6440us         3  2.2140us     495ns  3.3310us  cuDeviceGetCount
                    0.00%  4.3270us         2  2.1630us     992ns  3.3350us  cuDeviceGet
                    0.00%     827ns         4     206ns     108ns     378ns  cudaSetupArgument
                    0.00%     688ns         1     688ns     688ns     688ns  cudaConfigureCall

For HPC workloads, it is important to understand the compute to communication ratio in a program.

  • If your application spends more time computing than transferring data, then it may be possible to overlap computation with communication and completely hide the latency associated with transferring data (more details in Chapter 6).
  • If your application spends less time computing than transferring data, it is important to minimize the transfer between the host and device.

2.5 Understanding theoretical limits

While performing application optimization, it is important to determine how your application compares to theoretical limits. Counters collected from nvprof can help you derive instruction and memory throughput for your application. If you compare application measured values to theoretical peak values, you can determine if your application is limited by arithmetic or by memory bandwidth. Theoretical ratios can be derived as follows using Tesla K10 as an example:

  • Tesla K10 Peak Single Precision FLOPS:

745 MHz core clock * 2 GPUs/board * (8 multiprocessors * 192 fp32 cores/ multiprocessor) * 2 ops/cycle = 4.58 TFLOPS

  • Tesla K10 Peak Memory Bandwidth:

2 GPUs/board * 256 bit * 2500 MHz mem-clock * 2 DDR / 8 bits/ byte = 320 GB/s

  • Ratio of instruction/bytes:

4.58 TFLOPS / 320 GB/s yields 13.6 instructions/1 byte

This means: for Tesla K10, if your application issues more than 13.6 instructions for every byte accessed, then your application is bound by arithmetic performance.
(Note: Most HPC workloads are bound by memory bandwidth.)

3. Organizing Parllel Threads

3.1 Indexing matrices with blocks and threads

Thread and block index --> Coordinate in the matrix --> Offset in linear global memory

  • Thread and block index --> Coordinate in the matrix
int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = threadIdx.y + blockIdx.y * blockDim.y;
  • Coordinate in the matrix --> Offset in linear global memory
int idx = iy * nx + ix;


3.2 Case study: matrix addition

  • 2D Grid and 2D Blocks
dim3 block(16, 16);
dim3 grid((nx + block.x - 1) / block.x, (ny + block.y - 1) / block.y);
__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
    int ix = threadIdx.x + blockIdx.x * blockDim.x;
    int iy = threadIdx.y + blockIdx.y * blockDim.y;
    if (ix < nx && iy < ny)  {
        int idx = iy * nx + ix;
        C[idx] = A[idx] + B[idx];
    }
}
  • 1D Grid and 1D Blocks
dim3 block(32);
dim3 grid((nx + block.x-1) / block.x,1);
__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
    int ix = threadIdx.x + blockIdx.x * blockDim.x;
    if (ix < nx) {
        for (int iy = 0; iy < ny; iy++) {
            int idx = iy * nx + ix;
            C[idx] = A[idx] + B[idx];
        }
    }
}
  • 2D Grid and 1D Blocks
dim3 block(32);
dim3 grid((nx + block.x - 1) / block.x, ny);
__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
    int ix = threadIdx.x + blockIdx.x * blockDim.x;
    int iy = blockIdx.y;
    if (ix < nx && iy < ny)  {
        int idx = iy * nx + ix;
        C[idx] = A[idx] + B[idx];
    }
}
  • Changing execution configurations affects performance.
  • For a given kernel, trying different grid and block dimensions may yield better performance.
    (In Chapter 3, you will learn more about what causes these issues from a hardware perspective.)

4. Managing devices

4.1 Using the runtime API to query GPU information

  • Query GPU info

Refer to the code here

  • Determining the best GPU
    Some systems support multiple GPUs. In the case where each GPU is different, it may be important to select the best GPU to run your kernel. One way to identify the most computationally capable GPU is by the number of multiprocessors it contains.
int numDevices = 0; 
cudaGetDeviceCount(&numDevices); 
if (numDevices > 1) {
    int maxMultiprocessors = 0, maxDevice = 0;
    for (int device=0; device<numDevices; device++) {
        cudaDeviceProp props;
        cudaGetDeviceProperties(&props, device);
        if (maxMultiprocessors < props.multiProcessorCount) {
            maxMultiprocessors = props.multiProcessorCount;
            maxDevice = device; 
        }
    }
    cudaSetDevice(maxDevice); 
}

4.2 Using nvidia-smi to query GPU informatio

4.3 Setting devices at runtime

Set environment variable (in command line):

  • Set the variable for the lifespan of the current shell
export CUDA_VISIBLE_DEVICES = 0,1
  • Set the variable for the lifespan of the current execution
CUDA_VISIBLE_DEVICES = 0,1 ./checkDeviceInfor
⚠️ **GitHub.com Fallback** ⚠️