CUDA - yszheda/wiki GitHub Wiki

Assembly

Compute Cache

FMAD

context switch

Header

intrinsic

printf

Math API

type conversion

typedef union {
    float4 vec;
    float a[4];
} U4;

U4 u;

for (int i = 0; i < 4; ++i) u.a[i] = ...;

sqrt

Half Precision FP16

error: identifier not found

error: class "__half2" has no member "y"

__umul24

CUDA Stream

dependency

cudaStreamAddCallback

vectorization

In almost all cases vectorized loads are preferable to scalar loads. Note however that using vectorized loads increases register pressure and reduces overall parallelism. So if you have a kernel that is already register limited or has very low parallelism, you may want to stick to scalar loads. Also, as discussed earlier, if your pointer is not aligned or your data type size in bytes is not a power of two you cannot use vectorized loads.

Macro

Atomic


CUDA 9 NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort.

One way to improve filtering performance is to use shared memory atomics.

Another approach is to first use a parallel prefix sum to compute the output index of each element.

Warp-Aggregated Atomics

  1. Threads in the warp elect a leader thread.
  2. Threads in the warp compute the total atomic increment for the warp.
  3. The leader thread performs an atomic add to compute the offset for the warp.
  4. The leader thread broadcasts the offset to all other threads in the warp.
  5. Each thread adds its own index within the warp to the warp offset to get its position in the output array.



Warp-level Primitives


  • Using CUDA Warp-Level Primitives

  • In a SIMD architecture, each instruction applies the same operation in parallel across many data elements. SIMD is typically implemented using processors with vector registers and execution units; a scalar thread issues vector instructions that execute in SIMD fashion.

  • In a SIMT architecture, rather than a single thread issuing vector instructions applied to data vectors, multiple threads issue common instructions to arbitrary data.

#define FULL_MASK 0xffffffff
for (int offset = 16; offset > 0; offset /= 2)
    val += __shfl_down_sync(FULL_MASK, val, offset);

For a thread at lane X in the warp, __shfl_down_sync(FULL_MASK, val, offset) gets the value of the val variable from the thread at lane X+offset of the same warp.

Synchronized Data Exchange

Active Mask Query

__activemask()

Warp Synchronization

void __syncwarp(unsigned mask=0xffffffff);

The __syncwarp() primitive causes the executing thread to wait until all threads specified in mask have executed a __syncwarp() (with the same mask) before resuming execution. It also provides a memory fence to allow threads to communicate via memory before and after calling the primitive.

Make sure that __syncwarp() separates shared memory reads and writes to avoid race conditions.

Opportunistic Warp-level Programming

Implicit Warp-Synchronous Programming is Unsafe

Update Legacy Warp-Level Programming


Cooperative Groups


C.2.4. Coalesced Groups

In CUDA’s SIMT architecture, at the hardware level the multiprocessor executes threads in groups of 32 called warps. If there exists a data-dependent conditional branch in the application code such that threads within a warp diverge, then the warp serially executes each branch disabling threads not on that path. The threads that remain active on the path are referred to as coalesced.

C.2.5.1. Discovery Pattern

{
    unsigned int writemask = __activemask();
    unsigned int total = __popc(writemask);
    unsigned int prefix = __popc(writemask & __lanemask_lt());
    // Find the lowest-numbered active lane
    int elected_lane = __ffs(writemask) - 1;
    int base_offset = 0;
    if (prefix == 0) {
        base_offset = atomicAdd(p, total);
    }
    base_offset = __shfl_sync(writemask, base_offset, elected_lane);
    int thread_offset = prefix + base_offset;
    return thread_offset;
}
{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;
    if (g.thread_rank() == 0) {
        prev = atomicAdd(p, g.size());
    }
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}


Discovering Thread Concurrency

coalesced_group active = coalesced_threads();

Keep in mind that since threads from different warps are never coalesced, the largest group that coalesced_threads() can return is a full warp.


Mutex / Lock

cudaSetDeviceFlags

CUDA Launch

__cudaRegisterFatBinary

Trouble-shooting

code=4(cudaErrorLaunchFailure) "cudaFreeHost"

“invalid argument” error when starting kernel

block size超过限制大小

Driver/library version mismatch

misaligned address

error : argument of type "cudaStream_t" is incompatible with parameter of type "size_t"

// correct
kernel<<< blocks, threads, bytes, streamID >>>();
// wrong
kernel<<< blocks, threads, streamID >>>();

cudaErrorInvalidValue

memcpy: 非法地址

clamp

Template

// kernel.cu
template <class T>
__global__ void kernel_axpy(T* x, T* y, int len) { ... }

void axpy(float* x, float* y, int len){ kernel_axpy<<<...>>>(x,y,len); }
void axpy(double* x, double* y, int len){ kernel_axpy<<<...>>>(x,y,len); }

// axpy.h

extern void axpy(float* x, float* y, int len);
extern void axpy(double* x, double* y, int len);

template <class T> void cpp_axpy(T* x, T* y, int len) { std::cerr<<"Not implemented.\n"<<std::endl; }
template <> void cpp_axpy<float>(float* x, float* y, int len) { axpy(x,y,len); }
template <> void cpp_axpy<double>(double* x, double* y, int len) { axpy(x,y,len); }

// main.cpp

#include "axpy.h"

...
{
    axpy(xx,yy,length);
    cpp_axpy<double>(xxx,yyy,lll);
}
...

extern "C"

constant: identifier undefined in device code

Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFree.

related bug reports


CUDA APIs within destructor

From community wiki

Your code is unknowingly relying on undefined behaviour (the order of destruction of translation unit objects) and there is no real workaround other than to explicitly control and lifespan of objects containing CUDA runtime API calls in their destructor, or simply avoid using those API calls in destructors altogether.

In detail:

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error. C++ doesn't define the order of destruction when objects fall out of scope. Your singleton or object needs to be destroyed before the CUDA context is torn down, but there is no guarantee that will occur. This is effectively undefined behaviour.

From Robert Crovella

The placement of CUDA calls in a global object outside of main scope will lead to problematic behavior. See here. Although that description mostly focuses on kernel calls in such a class/object, the hazard applies to any CUDA call, as you have discovered.

To be clear, I should have said "The placement of CUDA calls in constructors and destructors of a global object outside of main scope will lead to problematic behavior. " Use of CUDA in other class methods may be possible (assuming e.g these methods don't get called by constructors/destructors, etc.)

From talonmies

There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error.

For instance, can I have my class maintain certain variables/handles that will force cuda run time library to stay loaded.

No. It is a bad design practice to put calls to the CUDA runtime API in constructors that may run before main and destructors that may run after main.


loaded shared lib

  • /lib64/libcuda.so
  • /lib64/libnvidia-fatbinaryloader.so

Solutions

Do not call CUDA API in dctor of static/global classes

skip checking cudaErrorCudartUnloading

Have a seperate de-initialisation / finalize method which calls CUDA API

From talonmies

The obvious answer is don't put CUDA API calls in the destructor. In your class you have an explicit intialisation method not called through the constructor, so why not have an explicit de-initialisation method as well? That way scope becomes a non-issue

register exit function after CUDA driver is loaded

int gpu_num;
cudaError_t err = cudaGetDeviceCount(&gpu_num);

std::atexit([](){
    // Call CUDA APIs to clean up
});

CUDA error 10 "invalid device ordinal" (when using cudaMemAdvise)

int device_id = 0, result = 0;
cudaDeviceGetAttribute (&result, cudaDevAttrConcurrentManagedAccess, device_id);
if (result) {
    // Call cudaMemAdvise
}

an illegal memory access was encountered

this error has been reported several times, usually being resolved as the GPU's fault, not Caffe's.


From rizwansarwar

Some more information, depending on your driver version, you get different crash error. So at I got 381.22 driver version, I got illegal memory error, but at 375.66 I get unspecified launch failure.

From derubm

Illegal memory access error is in case of Nvidia cards happen due to having a card running on max overclocked memory on Power state 2. When your miner does switch to P0 state for whatever reason, memory gets an additional 200 mhz and can (or will) get unstable, which causes this error.

From GPU Performance State Interface

        P-States are GPU active/executing performance capability states.
        They range from P0 to P15, with P0 being the highest performance state,
        and P15 being the lowest performance state. Each P-State, if available,
        maps to a performance level. Not all P-States are available on a given system.
        The definition of each P-States are currently as follow:
        - P0/P1 - Maximum 3D performance
        - P2/P3 - Balanced 3D performance-power
        - P8 - Basic HD video playback
        - P10 - DVD playback
        - P12 - Minimum idle power consumption

GTX1060 +150/+500/65%TDP @ 23-24MHs

  1. Try Update Drivers. Download and install the latests.

  2. Try Update Ethminer. Download (or beter build) the latest.

  3. Try use -U for CUDA devices. CUDA Hardware Test Launch Command: ethminer -RH -U -S eu1.ethermine.org:4444 -FS us1.ethermine.org:4444 -O 0x7013275311fc37ccc1e40193D75086293eCb43A4.issue128

  4. Try to change P2 State and Power managment mode. You can use NVidiaProfileInspectorDmW. For the best mining hashrate choose from sector "5 - Common":

CUDA - Force P2 State (Set to "Off") Power managment mode (Set to "Prefer maximum performance")

  1. Try Tweak Win10. You can use Windows10MiningTweaksDmW (#695).

  2. Try Optimize/Overclock GPUs. You can use MSI Afterburner for GPU OverClock/Optimize.

  3. Try use a WatchDog You can use ETHminerWatchDogDmW (#735).

NVRM Xid Graphics SM Warp Exception on : MMU Fault

CUDNN_STATUS_MAPPING_ERROR

CUDNN_STATUS_EXECUTION_FAILED

warning: Cuda API error detected: cudaMemcpy returned (0xb)

⚠️ **GitHub.com Fallback** ⚠️