CUDA - yszheda/wiki GitHub Wiki

Assembly

PTX

SASS


predicate

branch在出现divergence的时候,内部也有一个mask,表明当前这个thread是否active,但是用户不能直接修改这个mask。PTX中可以通过warp vote或是load特殊寄存器%lanemask_*之类的方法获得当前warp内的mask情况。

Opcode 和 Opcode Modifier

Float指令

没有直接的除法指令。浮点除法开销很大,x/y的近似算法是用x * rcp(y)来算的。精确算法一般是需要rcp(y)得到初值后,进行多步迭代。所以浮点数除法是比较慢的操作。

Integer指令

逻辑操作指令:现在多数逻辑操作都用3输入逻辑指令LOP3来实现,它支持三输入的任意按位逻辑操作。

整数乘法的实现。通用的32bit乘法或乘加,除了Maxwell和Pascal中用的是XMAD,Kepler和Volta、Turing、Ampere都是用IMAD。但是很多地址计算中有这种模式:d=a*Stride+c,在Stride是2的幂次时,可以用移位和加法来实现。这正是LEA指令的工作模式。Turing的IMADLEA分属不同的dispatch port,两者可以独立发射。因此这是一个可能增加ILP的小优化。

Turing的IMAD是个挺神奇的指令。大量的情况下会用来做MOV操作,比如IMAD.MOV.U32 R1, RZ, RZ, R0; 的作用就相当于MOV R1, R0;。那好处在哪呢?这个应该与Turing把Float32与普通ALU的dispatch port分开有关,IMAD用的也是float32的pipe,所以可以与MOV的发射错开,这个到聊指令发射逻辑的时候再细讲。IMAD还有带shift的模式,如IMAD.SHL.U32 R0, R0, 0x10, RZ ;,还有IMAD.WIDE可以用64bit数做第三操作数,等等。

格式转换指令

数据移动指令

warp shuffle指令SHFL。warp内如果需要进行数据交换,第一要想到的就是这个指令。它支持多种交换模式,对其他warp没有依赖,因而在一些场景下有很大的用处。其中一个典型应用是做warp内的reduction,比如scan(或者叫prefix sum)之类。有兴趣的读者可以看看cuda sample里shfl_scan这个例子。

Predicate操作指令

内存操作指令

跳转和分支指令

Uniform DataPath指令

Operand 和 Operand Modifier

  • GPR
  • Predicate Register
  • Constant memory
  • Immediate
  • Uniform Register和Uniform Predicate
  • 地址操作数

Control codes

Register Reuse Cache

Wait Dependency Barrier

Read Dependency Barrier

Write Dependency Barrier

Yield Hint Flag

如果Yield,就表示下一个cycle会优先发射其他warp的指令。

Stall Count



Compute Cache

FMAD

context switch

Header

intrinsic

printf

Math API

type conversion

typedef union {
    float4 vec;
    float a[4];
} U4;

U4 u;

for (int i = 0; i < 4; ++i) u.a[i] = ...;

sqrt

Half Precision FP16

error: identifier not found

error: class "__half2" has no member "y"

__umul24

CUDA Stream

dependency

cudaStreamAddCallback

vectorization

In almost all cases vectorized loads are preferable to scalar loads. Note however that using vectorized loads increases register pressure and reduces overall parallelism. So if you have a kernel that is already register limited or has very low parallelism, you may want to stick to scalar loads. Also, as discussed earlier, if your pointer is not aligned or your data type size in bytes is not a power of two you cannot use vectorized loads.

Macro

Atomic


CUDA 9 NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort.

One way to improve filtering performance is to use shared memory atomics.

Another approach is to first use a parallel prefix sum to compute the output index of each element.

Warp-Aggregated Atomics

  1. Threads in the warp elect a leader thread.
  2. Threads in the warp compute the total atomic increment for the warp.
  3. The leader thread performs an atomic add to compute the offset for the warp.
  4. The leader thread broadcasts the offset to all other threads in the warp.
  5. Each thread adds its own index within the warp to the warp offset to get its position in the output array.



Warp-level Primitives


  • Using CUDA Warp-Level Primitives

  • In a SIMD architecture, each instruction applies the same operation in parallel across many data elements. SIMD is typically implemented using processors with vector registers and execution units; a scalar thread issues vector instructions that execute in SIMD fashion.

  • In a SIMT architecture, rather than a single thread issuing vector instructions applied to data vectors, multiple threads issue common instructions to arbitrary data.

#define FULL_MASK 0xffffffff
for (int offset = 16; offset > 0; offset /= 2)
    val += __shfl_down_sync(FULL_MASK, val, offset);

For a thread at lane X in the warp, __shfl_down_sync(FULL_MASK, val, offset) gets the value of the val variable from the thread at lane X+offset of the same warp.

Synchronized Data Exchange

Active Mask Query

__activemask()

Warp Synchronization

void __syncwarp(unsigned mask=0xffffffff);

The __syncwarp() primitive causes the executing thread to wait until all threads specified in mask have executed a __syncwarp() (with the same mask) before resuming execution. It also provides a memory fence to allow threads to communicate via memory before and after calling the primitive.

Make sure that __syncwarp() separates shared memory reads and writes to avoid race conditions.

Opportunistic Warp-level Programming

Implicit Warp-Synchronous Programming is Unsafe

Update Legacy Warp-Level Programming


Cooperative Groups


C.2.4. Coalesced Groups

In CUDA’s SIMT architecture, at the hardware level the multiprocessor executes threads in groups of 32 called warps. If there exists a data-dependent conditional branch in the application code such that threads within a warp diverge, then the warp serially executes each branch disabling threads not on that path. The threads that remain active on the path are referred to as coalesced.

C.2.5.1. Discovery Pattern

{
    unsigned int writemask = __activemask();
    unsigned int total = __popc(writemask);
    unsigned int prefix = __popc(writemask & __lanemask_lt());
    // Find the lowest-numbered active lane
    int elected_lane = __ffs(writemask) - 1;
    int base_offset = 0;
    if (prefix == 0) {
        base_offset = atomicAdd(p, total);
    }
    base_offset = __shfl_sync(writemask, base_offset, elected_lane);
    int thread_offset = prefix + base_offset;
    return thread_offset;
}
{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;
    if (g.thread_rank() == 0) {
        prev = atomicAdd(p, g.size());
    }
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}


Discovering Thread Concurrency

coalesced_group active = coalesced_threads();

Keep in mind that since threads from different warps are never coalesced, the largest group that coalesced_threads() can return is a full warp.


Mutex / Lock

cudaSetDeviceFlags

CUDA Launch

__cudaRegisterFatBinary

Tensor Core

Layout

DP4A

image

MegEngine

MindSpore AKG (Auto Kernel Generator)

Trouble-shooting

code=4(cudaErrorLaunchFailure) "cudaFreeHost"

“invalid argument” error when starting kernel

block size超过限制大小

Driver/library version mismatch

misaligned address

error : argument of type "cudaStream_t" is incompatible with parameter of type "size_t"

// correct
kernel<<< blocks, threads, bytes, streamID >>>();
// wrong
kernel<<< blocks, threads, streamID >>>();

cudaErrorInvalidValue

memcpy: 非法地址

clamp

Template

// kernel.cu
template <class T>
__global__ void kernel_axpy(T* x, T* y, int len) { ... }

void axpy(float* x, float* y, int len){ kernel_axpy<<<...>>>(x,y,len); }
void axpy(double* x, double* y, int len){ kernel_axpy<<<...>>>(x,y,len); }

// axpy.h

extern void axpy(float* x, float* y, int len);
extern void axpy(double* x, double* y, int len);

template <class T> void cpp_axpy(T* x, T* y, int len) { std::cerr<<"Not implemented.\n"<<std::endl; }
template <> void cpp_axpy<float>(float* x, float* y, int len) { axpy(x,y,len); }
template <> void cpp_axpy<double>(double* x, double* y, int len) { axpy(x,y,len); }

// main.cpp

#include "axpy.h"

...
{
    axpy(xx,yy,length);
    cpp_axpy<double>(xxx,yyy,lll);
}
...

extern "C"

constant: identifier undefined in device code

Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFree.

related bug reports


CUDA APIs within destructor

From community wiki

Your code is unknowingly relying on undefined behaviour (the order of destruction of translation unit objects) and there is no real workaround other than to explicitly control and lifespan of objects containing CUDA runtime API calls in their destructor, or simply avoid using those API calls in destructors altogether.

In detail:

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error. C++ doesn't define the order of destruction when objects fall out of scope. Your singleton or object needs to be destroyed before the CUDA context is torn down, but there is no guarantee that will occur. This is effectively undefined behaviour.

From Robert Crovella

The placement of CUDA calls in a global object outside of main scope will lead to problematic behavior. See here. Although that description mostly focuses on kernel calls in such a class/object, the hazard applies to any CUDA call, as you have discovered.

To be clear, I should have said "The placement of CUDA calls in constructors and destructors of a global object outside of main scope will lead to problematic behavior. " Use of CUDA in other class methods may be possible (assuming e.g these methods don't get called by constructors/destructors, etc.)

From talonmies

There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error.

For instance, can I have my class maintain certain variables/handles that will force cuda run time library to stay loaded.

No. It is a bad design practice to put calls to the CUDA runtime API in constructors that may run before main and destructors that may run after main.


loaded shared lib

  • /lib64/libcuda.so
  • /lib64/libnvidia-fatbinaryloader.so

Solutions

Do not call CUDA API in dctor of static/global classes

skip checking cudaErrorCudartUnloading

Have a seperate de-initialisation / finalize method which calls CUDA API

From talonmies

The obvious answer is don't put CUDA API calls in the destructor. In your class you have an explicit intialisation method not called through the constructor, so why not have an explicit de-initialisation method as well? That way scope becomes a non-issue

register exit function after CUDA driver is loaded

int gpu_num;
cudaError_t err = cudaGetDeviceCount(&gpu_num);

std::atexit([](){
    // Call CUDA APIs to clean up
});

CUDA error 10 "invalid device ordinal" (when using cudaMemAdvise)

int device_id = 0, result = 0;
cudaDeviceGetAttribute (&result, cudaDevAttrConcurrentManagedAccess, device_id);
if (result) {
    // Call cudaMemAdvise
}

an illegal memory access was encountered

this error has been reported several times, usually being resolved as the GPU's fault, not Caffe's.


From rizwansarwar

Some more information, depending on your driver version, you get different crash error. So at I got 381.22 driver version, I got illegal memory error, but at 375.66 I get unspecified launch failure.

From derubm

Illegal memory access error is in case of Nvidia cards happen due to having a card running on max overclocked memory on Power state 2. When your miner does switch to P0 state for whatever reason, memory gets an additional 200 mhz and can (or will) get unstable, which causes this error.

From GPU Performance State Interface

        P-States are GPU active/executing performance capability states.
        They range from P0 to P15, with P0 being the highest performance state,
        and P15 being the lowest performance state. Each P-State, if available,
        maps to a performance level. Not all P-States are available on a given system.
        The definition of each P-States are currently as follow:
        - P0/P1 - Maximum 3D performance
        - P2/P3 - Balanced 3D performance-power
        - P8 - Basic HD video playback
        - P10 - DVD playback
        - P12 - Minimum idle power consumption

GTX1060 +150/+500/65%TDP @ 23-24MHs

  1. Try Update Drivers. Download and install the latests.

  2. Try Update Ethminer. Download (or beter build) the latest.

  3. Try use -U for CUDA devices. CUDA Hardware Test Launch Command: ethminer -RH -U -S eu1.ethermine.org:4444 -FS us1.ethermine.org:4444 -O 0x7013275311fc37ccc1e40193D75086293eCb43A4.issue128

  4. Try to change P2 State and Power managment mode. You can use NVidiaProfileInspectorDmW. For the best mining hashrate choose from sector "5 - Common":

CUDA - Force P2 State (Set to "Off") Power managment mode (Set to "Prefer maximum performance")

  1. Try Tweak Win10. You can use Windows10MiningTweaksDmW (#695).

  2. Try Optimize/Overclock GPUs. You can use MSI Afterburner for GPU OverClock/Optimize.

  3. Try use a WatchDog You can use ETHminerWatchDogDmW (#735).

NVRM Xid Graphics SM Warp Exception on : MMU Fault

CUDNN_STATUS_MAPPING_ERROR

CUDNN_STATUS_EXECUTION_FAILED

warning: Cuda API error detected: cudaMemcpy returned (0xb)

⚠️ **GitHub.com Fallback** ⚠️