Chapter 02 CUDA programming model - SaoYan/Learning

1. GPU memory and threads
- 1.1 API for GPU memory operation
- 1.2 GPU thread hierarchy abstraction and memory structure
2. CUDA kernel
3. Organizing Parllel Threads
- 3.1 Indexing matrices with blocks and threads
- 3.2 Case study: matrix addition
4. Managing devices

CUDA programming structure

allocate GPU mem
copy data from CPU to GPU
invode CUDA kernel
copy data from GPU back to CPU
destroy GPU mem

1. GPU memory and threads

1.1 API for GPU memory operation

1.2 GPU thread hierarchy abstraction and memory structure

Thread abstration: grid of blocks & block of threads
- Grid: contains all threads spawned by a single kernal launch.
- Block: contains a group if threads that can cooperate with each other. (threads in different blocks cannot communicate)
Memory structure
- Shared memory: each block has its shared memory; can be accessed by threads within kernel
- Global memory: all threads in a grid share the same global memory.
access block/thread within kernel
- Dimension of grid and block: two built-in variables gridDim and blockDim (data type dim3)
- ID of block and thread: two built-in variables blockIdx and threadIdx (data type uint3)
- How to compute the global thread index whintin kernel?

2. CUDA kernel

2.1 Writing CUDA kernel

function types in CUDA
- The __device__ and __host__ qualifierscanbeusedtogether,inwhichcasethefunctionis compiled for both the host and the device.
- __global__ functions can also be called from devices of compute capability 3.
- Kernel is __global__

CUDA kernal must
- access device memory only
- has void return type
- no support a variable number of argument
- no support function pointers
- be asynchronous

Example: adding two vector together

C function:

void sumArraysOnHost(float *A, float *B, float *C, const int N) {
    for (int i = 0; i < N; i++) {
        C[i] = A[i] + B[i];
    }
}

CUDA kernel (suppose grid(1), block(N)):

__global__ void sumArraysOnDevice(float *A, float *B, float *C) {
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

Lauching this kernel:

sumArraysOnDevice<<<1, N>>>(A, B, C);

2.2 Lauching CUDA kernel

kernel_name<<<grid, block>>>(args...)

A kernal call is asynchronous w.r.t the host thread; but we can but we can force the host to wait for the kernel to complete by calling cudaDeviceSynchronize()
Some CUDA run-time APIs are implicitly synchronized. For example cudaMemcpy(): data copy starts after all previous kernel calls are completed; the host must wait for the copy to complete

2.3 Debugging tricks

2.4 Timing the kernal with nvprof

nvprof [nvprof_args] <application> [application_args]

For example, running nvprof ./sumVectorsOnDevice gives the following outputs:

==5846== Profiling application: ./sumVectorsOnDevice
==5846== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   63.85%  174.39ms         2  87.194ms  87.103ms  87.285ms  [CUDA memcpy HtoD]
                   32.90%  89.840ms         1  89.840ms  89.840ms  89.840ms  [CUDA memcpy DtoH]
                    3.25%  8.8738ms         1  8.8738ms  8.8738ms  8.8738ms  sumArraysOnDevice(float*, float*, float*, int)
      API calls:   51.08%  264.39ms         3  88.129ms  87.172ms  89.922ms  cudaMemcpy
                   27.18%  140.66ms         3  46.886ms  681.16us  139.27ms  cudaMalloc
                   16.90%  87.469ms         1  87.469ms  87.469ms  87.469ms  cudaDeviceReset
                    2.68%  13.886ms         3  4.6288ms  677.44us  6.9015ms  cudaFree
                    1.73%  8.9369ms         1  8.9369ms  8.9369ms  8.9369ms  cudaDeviceSynchronize
                    0.17%  872.99us        94  9.2870us     508ns  376.16us  cuDeviceGetAttribute
                    0.16%  814.02us         1  814.02us  814.02us  814.02us  cudaGetDeviceProperties
                    0.08%  402.53us         1  402.53us  402.53us  402.53us  cuDeviceTotalMem
                    0.02%  115.16us         1  115.16us  115.16us  115.16us  cuDeviceGetName
                    0.00%  20.226us         1  20.226us  20.226us  20.226us  cudaLaunch
                    0.00%  15.084us         1  15.084us  15.084us  15.084us  cudaSetDevice
                    0.00%  6.6440us         3  2.2140us     495ns  3.3310us  cuDeviceGetCount
                    0.00%  4.3270us         2  2.1630us     992ns  3.3350us  cuDeviceGet
                    0.00%     827ns         4     206ns     108ns     378ns  cudaSetupArgument
                    0.00%     688ns         1     688ns     688ns     688ns  cudaConfigureCall

For HPC workloads, it is important to understand the compute to communication ratio in a program.

If your application spends more time computing than transferring data, then it may be possible to overlap computation with communication and completely hide the latency associated with transferring data (more details in Chapter 6).
If your application spends less time computing than transferring data, it is important to minimize the transfer between the host and device.

2.5 Understanding theoretical limits

While performing application optimization, it is important to determine how your application compares to theoretical limits. Counters collected from nvprof can help you derive instruction and memory throughput for your application. If you compare application measured values to theoretical peak values, you can determine if your application is limited by arithmetic or by memory bandwidth. Theoretical ratios can be derived as follows using Tesla K10 as an example:

Tesla K10 Peak Single Precision FLOPS:

745 MHz core clock * 2 GPUs/board * (8 multiprocessors * 192 fp32 cores/ multiprocessor) * 2 ops/cycle = 4.58 TFLOPS

Tesla K10 Peak Memory Bandwidth:

2 GPUs/board * 256 bit * 2500 MHz mem-clock * 2 DDR / 8 bits/ byte = 320 GB/s

Ratio of instruction/bytes:

4.58 TFLOPS / 320 GB/s yields 13.6 instructions/1 byte

This means: for Tesla K10, if your application issues more than 13.6 instructions for every byte accessed, then your application is bound by arithmetic performance.
(Note: Most HPC workloads are bound by memory bandwidth.)

3. Organizing Parllel Threads

3.1 Indexing matrices with blocks and threads

Thread and block index --> Coordinate in the matrix --> Offset in linear global memory

Thread and block index --> Coordinate in the matrix

int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = threadIdx.y + blockIdx.y * blockDim.y;

Coordinate in the matrix --> Offset in linear global memory

int idx = iy * nx + ix;

3.2 Case study: matrix addition

2D Grid and 2D Blocks

dim3 block(16, 16);
dim3 grid((nx + block.x - 1) / block.x, (ny + block.y - 1) / block.y);

__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
    int ix = threadIdx.x + blockIdx.x * blockDim.x;
    int iy = threadIdx.y + blockIdx.y * blockDim.y;
    if (ix < nx && iy < ny)  {
        int idx = iy * nx + ix;
        C[idx] = A[idx] + B[idx];
    }
}

1D Grid and 1D Blocks

dim3 block(32);
dim3 grid((nx + block.x-1) / block.x,1);

__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
    int ix = threadIdx.x + blockIdx.x * blockDim.x;
    if (ix < nx) {
        for (int iy = 0; iy < ny; iy++) {
            int idx = iy * nx + ix;
            C[idx] = A[idx] + B[idx];
        }
    }
}

2D Grid and 1D Blocks

dim3 block(32);
dim3 grid((nx + block.x - 1) / block.x, ny);

__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
    int ix = threadIdx.x + blockIdx.x * blockDim.x;
    int iy = blockIdx.y;
    if (ix < nx && iy < ny)  {
        int idx = iy * nx + ix;
        C[idx] = A[idx] + B[idx];
    }
}

Changing execution configurations affects performance.
For a given kernel, trying different grid and block dimensions may yield better performance.
(In Chapter 3, you will learn more about what causes these issues from a hardware perspective.)

4. Managing devices

4.1 Using the runtime API to query GPU information

Query GPU info

Refer to the code here

Determining the best GPU
Some systems support multiple GPUs. In the case where each GPU is different, it may be important to select the best GPU to run your kernel. One way to identify the most computationally capable GPU is by the number of multiprocessors it contains.

int numDevices = 0; 
cudaGetDeviceCount(&numDevices); 
if (numDevices > 1) {
    int maxMultiprocessors = 0, maxDevice = 0;
    for (int device=0; device<numDevices; device++) {
        cudaDeviceProp props;
        cudaGetDeviceProperties(&props, device);
        if (maxMultiprocessors < props.multiProcessorCount) {
            maxMultiprocessors = props.multiProcessorCount;
            maxDevice = device; 
        }
    }
    cudaSetDevice(maxDevice); 
}

4.2 Using nvidia-smi to query GPU informatio

4.3 Setting devices at runtime

Set environment variable (in command line):

Set the variable for the lifespan of the current shell

export CUDA_VISIBLE_DEVICES = 0,1

Set the variable for the lifespan of the current execution

CUDA_VISIBLE_DEVICES = 0,1 ./checkDeviceInfor

Chapter 02 CUDA programming model - SaoYan/Learning_CUDA GitHub Wiki

1. GPU memory and threads

1.1 API for GPU memory operation

1.2 GPU thread hierarchy abstraction and memory structure

2. CUDA kernel

2.1 Writing CUDA kernel

2.2 Lauching CUDA kernel

2.3 Debugging tricks

2.4 Timing the kernal with nvprof

2.5 Understanding theoretical limits

3. Organizing Parllel Threads

3.1 Indexing matrices with blocks and threads

3.2 Case study: matrix addition

4. Managing devices

4.1 Using the runtime API to query GPU information

4.2 Using nvidia-smi to query GPU informatio

4.3 Setting devices at runtime

⚠️ GitHub.com Fallback ⚠️

Chapter 02 CUDA programming model - SaoYan/Learning_CUDA GitHub Wiki

1. GPU memory and threads

1.1 API for GPU memory operation

1.2 GPU thread hierarchy abstraction and memory structure

2. CUDA kernel

2.1 Writing CUDA kernel

2.2 Lauching CUDA kernel

2.3 Debugging tricks

2.4 Timing the kernal with nvprof

2.5 Understanding theoretical limits

3. Organizing Parllel Threads

3.1 Indexing matrices with blocks and threads

3.2 Case study: matrix addition

4. Managing devices

4.1 Using the runtime API to query GPU information

4.2 Using nvidia-smi to query GPU informatio

4.3 Setting devices at runtime

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️