Chapter 02 CUDA programming model - SaoYan/Learning_CUDA GitHub Wiki
CUDA programming structure
- allocate GPU mem
- copy data from CPU to GPU
- invode CUDA kernel
- copy data from GPU back to CPU
- destroy GPU mem
-
Thread abstration: grid of blocks & block of threads
- Grid: contains all threads spawned by a single kernal launch.
- Block: contains a group if threads that can cooperate with each other. (threads in different blocks cannot communicate)
-
Memory structure
- Shared memory: each block has its shared memory; can be accessed by threads within kernel
- Global memory: all threads in a grid share the same global memory.
-
access block/thread within kernel
- Dimension of grid and block: two built-in variables gridDim and blockDim (data type dim3)
- ID of block and thread: two built-in variables blockIdx and threadIdx (data type uint3)
- How to compute the global thread index whintin kernel?
- function types in CUDA
- The __device__ and __host__ qualifierscanbeusedtogether,inwhichcasethefunctionis compiled for both the host and the device.
- __global__ functions can also be called from devices of compute capability 3.
- Kernel is __global__
- CUDA kernal must
- access device memory only
- has void return type
- no support a variable number of argument
- no support function pointers
- be asynchronous
Example: adding two vector together
C function:
void sumArraysOnHost(float *A, float *B, float *C, const int N) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
CUDA kernel (suppose grid(1), block(N)):
__global__ void sumArraysOnDevice(float *A, float *B, float *C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
Lauching this kernel:
sumArraysOnDevice<<<1, N>>>(A, B, C);
kernel_name<<<grid, block>>>(args...)
- A kernal call is asynchronous w.r.t the host thread; but we can but we can force the host to wait for the kernel to complete by calling cudaDeviceSynchronize()
- Some CUDA run-time APIs are implicitly synchronized. For example cudaMemcpy(): data copy starts after all previous kernel calls are completed; the host must wait for the copy to complete
nvprof [nvprof_args] <application> [application_args]
For example, running nvprof ./sumVectorsOnDevice gives the following outputs:
==5846== Profiling application: ./sumVectorsOnDevice
==5846== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 63.85% 174.39ms 2 87.194ms 87.103ms 87.285ms [CUDA memcpy HtoD]
32.90% 89.840ms 1 89.840ms 89.840ms 89.840ms [CUDA memcpy DtoH]
3.25% 8.8738ms 1 8.8738ms 8.8738ms 8.8738ms sumArraysOnDevice(float*, float*, float*, int)
API calls: 51.08% 264.39ms 3 88.129ms 87.172ms 89.922ms cudaMemcpy
27.18% 140.66ms 3 46.886ms 681.16us 139.27ms cudaMalloc
16.90% 87.469ms 1 87.469ms 87.469ms 87.469ms cudaDeviceReset
2.68% 13.886ms 3 4.6288ms 677.44us 6.9015ms cudaFree
1.73% 8.9369ms 1 8.9369ms 8.9369ms 8.9369ms cudaDeviceSynchronize
0.17% 872.99us 94 9.2870us 508ns 376.16us cuDeviceGetAttribute
0.16% 814.02us 1 814.02us 814.02us 814.02us cudaGetDeviceProperties
0.08% 402.53us 1 402.53us 402.53us 402.53us cuDeviceTotalMem
0.02% 115.16us 1 115.16us 115.16us 115.16us cuDeviceGetName
0.00% 20.226us 1 20.226us 20.226us 20.226us cudaLaunch
0.00% 15.084us 1 15.084us 15.084us 15.084us cudaSetDevice
0.00% 6.6440us 3 2.2140us 495ns 3.3310us cuDeviceGetCount
0.00% 4.3270us 2 2.1630us 992ns 3.3350us cuDeviceGet
0.00% 827ns 4 206ns 108ns 378ns cudaSetupArgument
0.00% 688ns 1 688ns 688ns 688ns cudaConfigureCall
For HPC workloads, it is important to understand the compute to communication ratio in a program.
- If your application spends more time computing than transferring data, then it may be possible to overlap computation with communication and completely hide the latency associated with transferring data (more details in Chapter 6).
- If your application spends less time computing than transferring data, it is important to minimize the transfer between the host and device.
While performing application optimization, it is important to determine how your application compares to theoretical limits. Counters collected from nvprof can help you derive instruction and memory throughput for your application. If you compare application measured values to theoretical peak values, you can determine if your application is limited by arithmetic or by memory bandwidth. Theoretical ratios can be derived as follows using Tesla K10 as an example:
- Tesla K10 Peak Single Precision FLOPS:
745 MHz core clock * 2 GPUs/board * (8 multiprocessors * 192 fp32 cores/ multiprocessor) * 2 ops/cycle = 4.58 TFLOPS
- Tesla K10 Peak Memory Bandwidth:
2 GPUs/board * 256 bit * 2500 MHz mem-clock * 2 DDR / 8 bits/ byte = 320 GB/s
- Ratio of instruction/bytes:
4.58 TFLOPS / 320 GB/s yields 13.6 instructions/1 byte
This means: for Tesla K10, if your application issues more than 13.6 instructions for every byte accessed, then your application is bound by arithmetic performance.
(Note: Most HPC workloads are bound by memory bandwidth.)
Thread and block index --> Coordinate in the matrix --> Offset in linear global memory
- Thread and block index --> Coordinate in the matrix
int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = threadIdx.y + blockIdx.y * blockDim.y;
- Coordinate in the matrix --> Offset in linear global memory
int idx = iy * nx + ix;
- 2D Grid and 2D Blocks
dim3 block(16, 16);
dim3 grid((nx + block.x - 1) / block.x, (ny + block.y - 1) / block.y);
__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = threadIdx.y + blockIdx.y * blockDim.y;
if (ix < nx && iy < ny) {
int idx = iy * nx + ix;
C[idx] = A[idx] + B[idx];
}
}
- 1D Grid and 1D Blocks
dim3 block(32);
dim3 grid((nx + block.x-1) / block.x,1);
__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
int ix = threadIdx.x + blockIdx.x * blockDim.x;
if (ix < nx) {
for (int iy = 0; iy < ny; iy++) {
int idx = iy * nx + ix;
C[idx] = A[idx] + B[idx];
}
}
}
- 2D Grid and 1D Blocks
dim3 block(32);
dim3 grid((nx + block.x - 1) / block.x, ny);
__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int nx, const int ny) {
int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = blockIdx.y;
if (ix < nx && iy < ny) {
int idx = iy * nx + ix;
C[idx] = A[idx] + B[idx];
}
}
- Changing execution configurations affects performance.
- For a given kernel, trying different grid and block dimensions may yield better performance.
(In Chapter 3, you will learn more about what causes these issues from a hardware perspective.)
- Query GPU info
- Determining the best GPU
Some systems support multiple GPUs. In the case where each GPU is different, it may be important to select the best GPU to run your kernel. One way to identify the most computationally capable GPU is by the number of multiprocessors it contains.
int numDevices = 0;
cudaGetDeviceCount(&numDevices);
if (numDevices > 1) {
int maxMultiprocessors = 0, maxDevice = 0;
for (int device=0; device<numDevices; device++) {
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device);
if (maxMultiprocessors < props.multiProcessorCount) {
maxMultiprocessors = props.multiProcessorCount;
maxDevice = device;
}
}
cudaSetDevice(maxDevice);
}
Set environment variable (in command line):
- Set the variable for the lifespan of the current shell
export CUDA_VISIBLE_DEVICES = 0,1
- Set the variable for the lifespan of the current execution
CUDA_VISIBLE_DEVICES = 0,1 ./checkDeviceInfor