CUDA - yszheda/wiki GitHub Wiki
branch在出现divergence的时候,内部也有一个mask,表明当前这个thread是否active,但是用户不能直接修改这个mask。PTX中可以通过warp vote或是load特殊寄存器
%lanemask_*
之类的方法获得当前warp内的mask情况。
没有直接的除法指令。浮点除法开销很大,
x/y
的近似算法是用x * rcp(y)
来算的。精确算法一般是需要rcp(y)
得到初值后,进行多步迭代。所以浮点数除法是比较慢的操作。
逻辑操作指令:现在多数逻辑操作都用3输入逻辑指令
LOP3
来实现,它支持三输入的任意按位逻辑操作。
整数乘法的实现。通用的32bit乘法或乘加,除了Maxwell和Pascal中用的是
XMAD
,Kepler和Volta、Turing、Ampere都是用IMAD
。但是很多地址计算中有这种模式:d=a*Stride+c
,在Stride是2的幂次时,可以用移位和加法来实现。这正是LEA
指令的工作模式。Turing的IMAD
和LEA
分属不同的dispatch port,两者可以独立发射。因此这是一个可能增加ILP的小优化。
Turing的
IMAD
是个挺神奇的指令。大量的情况下会用来做MOV
操作,比如IMAD.MOV.U32 R1, RZ, RZ, R0;
的作用就相当于MOV R1, R0;
。那好处在哪呢?这个应该与Turing把Float32与普通ALU的dispatch port分开有关,IMAD
用的也是float32的pipe,所以可以与MOV的发射错开,这个到聊指令发射逻辑的时候再细讲。IMAD
还有带shift的模式,如IMAD.SHL.U32 R0, R0, 0x10, RZ ;
,还有IMAD.WIDE
可以用64bit数做第三操作数,等等。
warp shuffle指令
SHFL
。warp内如果需要进行数据交换,第一要想到的就是这个指令。它支持多种交换模式,对其他warp没有依赖,因而在一些场景下有很大的用处。其中一个典型应用是做warp内的reduction,比如scan(或者叫prefix sum)之类。有兴趣的读者可以看看cuda sample里shfl_scan这个例子。
- GPR
- Predicate Register
- Constant memory
- Immediate
- Uniform Register和Uniform Predicate
- 地址操作数
如果Yield,就表示下一个cycle会优先发射其他warp的指令。
- https://stackoverflow.com/questions/24254975/measure-the-overhead-of-context-switching-in-gpu
- https://stackoverflow.com/questions/6605581/what-is-the-context-switching-mechanism-in-gpu
- Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h
- How to properly link cuda header file with device functions?
- intrinsic math functions for float2, float4
- SIMD intrinsics - are they usable on gpus?
- Can CUDA use SIMD extensions?
- Performance in CUDA
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#formatted-output
- printf inside CUDA global function
typedef union {
float4 vec;
float a[4];
} U4;
U4 u;
for (int i = 0; i < 4; ++i) u.a[i] = ...;
- Float16 and Quantized Int8 Type
- Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?
- How FP32 and FP16 units are implemented in GP100 GPU's
- fp16 support in cuda thrust
- error when trying to use half (fp16)
- (github nccl) Undefined identifiers in all_reduce.cu
- CUDA compilation error: __hmul and __hneg are undefined
- https://stackoverflow.com/questions/37133128/use-of-half2-in-cuda
- https://stackoverflow.com/questions/43120062/cuda-cublas-and-half-precision-data-types
-
Get rid of busy waiting during asynchronous cuda stream executions
-
Does CPU waits for DEVICE to let it finish its kernel execution…?
- http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-callbacks
- cuda streams: callback not getting called after stream execution
- CUDA Example: Stream Callbacks
- how can I use cudaStreamAddCallback() with a class member method?
- https://github.com/sarvex/multicore/blob/fb6ce5e6814c1b63044e4a40573de8ad687e6a4b/Chapter6_GPU/memcpyTestCallback.cu
- Why does cudaStreamAddCallback serialize kernel execution and break concurrency?
In almost all cases vectorized loads are preferable to scalar loads. Note however that using vectorized loads increases register pressure and reduces overall parallelism. So if you have a kernel that is already register limited or has very low parallelism, you may want to stick to scalar loads. Also, as discussed earlier, if your pointer is not aligned or your data type size in bytes is not a power of two you cannot use vectorized loads.
- How can I load the 128 bit data the fastest and with compatibility both GPU (CUDA C++) and with CPU (C++)?
- Efficiency of CUDA vector types (float2, float3, float4)
- Are there advantages to using the CUDA vector types?
- Why are CUDA vector types (int4, float4) faster?
- Vector operations in cuda?
- How to properly cast a global memory array using the uint4 vector in CUDA to increase memory throughput?
CUDA 9 NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort.
One way to improve filtering performance is to use shared memory atomics.
Another approach is to first use a parallel prefix sum to compute the output index of each element.
- Threads in the warp elect a leader thread.
- Threads in the warp compute the total atomic increment for the warp.
- The leader thread performs an atomic add to compute the offset for the warp.
- The leader thread broadcasts the offset to all other threads in the warp.
- Each thread adds its own index within the warp to the warp offset to get its position in the output array.
-
atomicAdd and shared memory issue Running the histogram code from "Cuda by example" book.
-
CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics
-
How to use atomicCAS for multiple variables with conditionals in CUDA
-
How much faster are atomicAdd() operations to shared on SM >= 5X?
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-vote-functions
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-shuffle-functions
-
In a SIMD architecture, each instruction applies the same operation in parallel across many data elements. SIMD is typically implemented using processors with vector registers and execution units; a scalar thread issues vector instructions that execute in SIMD fashion.
-
In a SIMT architecture, rather than a single thread issuing vector instructions applied to data vectors, multiple threads issue common instructions to arbitrary data.
#define FULL_MASK 0xffffffff
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(FULL_MASK, val, offset);
For a thread at lane X in the warp,
__shfl_down_sync(FULL_MASK, val, offset)
gets the value of theval
variable from the thread at lane X+offset of the same warp.
__activemask()
void __syncwarp(unsigned mask=0xffffffff);
The
__syncwarp()
primitive causes the executing thread to wait until all threads specified in mask have executed a__syncwarp()
(with the same mask) before resuming execution. It also provides a memory fence to allow threads to communicate via memory before and after calling the primitive.
Make sure that
__syncwarp()
separates shared memory reads and writes to avoid race conditions.
C.2.4. Coalesced Groups
In CUDA’s SIMT architecture, at the hardware level the multiprocessor executes threads in groups of 32 called warps. If there exists a data-dependent conditional branch in the application code such that threads within a warp diverge, then the warp serially executes each branch disabling threads not on that path. The threads that remain active on the path are referred to as coalesced.
C.2.5.1. Discovery Pattern
{
unsigned int writemask = __activemask();
unsigned int total = __popc(writemask);
unsigned int prefix = __popc(writemask & __lanemask_lt());
// Find the lowest-numbered active lane
int elected_lane = __ffs(writemask) - 1;
int base_offset = 0;
if (prefix == 0) {
base_offset = atomicAdd(p, total);
}
base_offset = __shfl_sync(writemask, base_offset, elected_lane);
int thread_offset = prefix + base_offset;
return thread_offset;
}
{
cg::coalesced_group g = cg::coalesced_threads();
int prev;
if (g.thread_rank() == 0) {
prev = atomicAdd(p, g.size());
}
prev = g.thread_rank() + g.shfl(prev, 0);
return prev;
}
coalesced_group active = coalesced_threads();
Keep in mind that since threads from different warps are never coalesced, the largest group that
coalesced_threads()
can return is a full warp.
- Cuda atomics change flag
- Try to use lock and unlock in CUDA
- CUDA, mutex and atomicCAS()
- Lock reading/writing for rows in two dimension array in global memory
- Implementing a critical section in CUDA
- Atomic float operations. especially add
- atomic read or write
- CUDA: Forgetting kernel launch configuration does not result in NVCC compiler warning or error
- CUDA kernel launch parameters explained right?
-
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma
-
Nvidia Tensor Core-Getting Started with WMMA API Programming
- CUDA Tensor Layouts for Convolution
- NHWC vs NCHW : A memory access perspective on GPUs
- How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?
- tensorflow layout optimizer && conv autotune
- https://www.megengine.org.cn/doc/stable/zh/user-guide/model-development/tensor/layout.html
- 全局图优化:提升 MegEngine 模型推理性能的又一神器
- 黑科技:用 cutlass 进行低成本、高性能卷积算子定制开发
- How to Optimize GEMM By Using Tensor Cores
- How to Program Tensor Cores with Polyhedral Model for GEMM
- How to optimize Convolution using Tensor Cores
block size超过限制大小
- https://stackoverflow.com/questions/37323053/misaligned-address-in-cuda
- https://stackoverflow.com/questions/12778949/cuda-memory-alignment
- https://en.cppreference.com/w/cpp/language/alignas
- CUDA Runtime API error 74: misaligned address
// correct
kernel<<< blocks, threads, bytes, streamID >>>();
// wrong
kernel<<< blocks, threads, streamID >>>();
memcpy: 非法地址
- cudaMemcpy returns invalid value
- https://stackoverflow.com/questions/3079880/cuda-cudamemcpy-returns-cudaerrorinvalidvalue-for-device-array
-
CUDA C++: Using a template function which calls a template kernel
-
template cuda kernel function cannot be called in another template function on vs2013
-
Unresolved externals in CUDA expression template library under Visual Studio 2010
// kernel.cu
template <class T>
__global__ void kernel_axpy(T* x, T* y, int len) { ... }
void axpy(float* x, float* y, int len){ kernel_axpy<<<...>>>(x,y,len); }
void axpy(double* x, double* y, int len){ kernel_axpy<<<...>>>(x,y,len); }
// axpy.h
extern void axpy(float* x, float* y, int len);
extern void axpy(double* x, double* y, int len);
template <class T> void cpp_axpy(T* x, T* y, int len) { std::cerr<<"Not implemented.\n"<<std::endl; }
template <> void cpp_axpy<float>(float* x, float* y, int len) { axpy(x,y,len); }
template <> void cpp_axpy<double>(double* x, double* y, int len) { axpy(x,y,len); }
// main.cpp
#include "axpy.h"
...
{
axpy(xx,yy,length);
cpp_axpy<double>(xxx,yyy,lll);
}
...
- how solve [extern "C" template]?
- How to make CUDA object file with C linkage?
- Is extern “C” no longer needed anymore in cuda? [closed]
- Compiling C and CUDA code Problems linking CUDA code and C code
- CUDA学习笔记2
- CUDA ,extern "C" --NVCC编译器的问题, whole program compilation与separate compilation
- Noob Q: How to extern c function?
- [!]host float constant usage in a kernel in CUDA
- How can I use static const members in CUDA?
- CUDA - Using constant variables and cudaMemcpyFromSymbol
- constant memory which is device-side only (avoiding cudaMemcpyToSymbol)
Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFree.
-
cudaErrorCudartUnloading (error 29) due to “driver shutting down”
-
出现这个错误是在程序成功运行之后出现的,也就是说,所有的功能可以正常实现,最后报了这个错误。经过不断尝试,找到原因由于加载模型时。这些变量定义全局变量,导致上面的问题。详细原因暂时没有思路。
From community wiki
Your code is unknowingly relying on undefined behaviour (the order of destruction of translation unit objects) and there is no real workaround other than to explicitly control and lifespan of objects containing CUDA runtime API calls in their destructor, or simply avoid using those API calls in destructors altogether.
In detail:
The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error. C++ doesn't define the order of destruction when objects fall out of scope. Your singleton or object needs to be destroyed before the CUDA context is torn down, but there is no guarantee that will occur. This is effectively undefined behaviour.
From Robert Crovella
The placement of CUDA calls in a global object outside of main scope will lead to problematic behavior. See here. Although that description mostly focuses on kernel calls in such a class/object, the hazard applies to any CUDA call, as you have discovered.
To be clear, I should have said "The placement of CUDA calls in constructors and destructors of a global object outside of main scope will lead to problematic behavior. " Use of CUDA in other class methods may be possible (assuming e.g these methods don't get called by constructors/destructors, etc.)
From talonmies
There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error.
For instance, can I have my class maintain certain variables/handles that will force cuda run time library to stay loaded.
No. It is a bad design practice to put calls to the CUDA runtime API in constructors that may run before main and destructors that may run after main.
- Global static destructors called when a dynamic library is unloaded?
- CUDA: Why is it not possible to define static global member functions?
- Why is there no support for static class members?
- /lib64/libcuda.so
- /lib64/libnvidia-fatbinaryloader.so
From talonmies
The obvious answer is don't put CUDA API calls in the destructor. In your class you have an explicit intialisation method not called through the constructor, so why not have an explicit de-initialisation method as well? That way scope becomes a non-issue
- https://github.com/apache/incubator-mxnet/issues/4219
-
-
MXNotifyShutdown
(https://github.com/apache/incubator-mxnet/blob/master/src/c_api/c_api.cc)
-
int gpu_num;
cudaError_t err = cudaGetDeviceCount(&gpu_num);
std::atexit([](){
// Call CUDA APIs to clean up
});
int device_id = 0, result = 0;
cudaDeviceGetAttribute (&result, cudaDevAttrConcurrentManagedAccess, device_id);
if (result) {
// Call cudaMemAdvise
}
this error has been reported several times, usually being resolved as the GPU's fault, not Caffe's.
From rizwansarwar
Some more information, depending on your driver version, you get different crash error. So at I got 381.22 driver version, I got illegal memory error, but at 375.66 I get unspecified launch failure.
From derubm
Illegal memory access error is in case of Nvidia cards happen due to having a card running on max overclocked memory on Power state 2. When your miner does switch to P0 state for whatever reason, memory gets an additional 200 mhz and can (or will) get unstable, which causes this error.
From GPU Performance State Interface
P-States are GPU active/executing performance capability states. They range from P0 to P15, with P0 being the highest performance state, and P15 being the lowest performance state. Each P-State, if available, maps to a performance level. Not all P-States are available on a given system. The definition of each P-States are currently as follow: - P0/P1 - Maximum 3D performance - P2/P3 - Balanced 3D performance-power - P8 - Basic HD video playback - P10 - DVD playback - P12 - Minimum idle power consumption
GTX1060 +150/+500/65%TDP @ 23-24MHs
Try Update Drivers. Download and install the latests.
Try Update Ethminer. Download (or beter build) the latest.
Try use -U for CUDA devices. CUDA Hardware Test Launch Command: ethminer -RH -U -S eu1.ethermine.org:4444 -FS us1.ethermine.org:4444 -O 0x7013275311fc37ccc1e40193D75086293eCb43A4.issue128
Try to change P2 State and Power managment mode. You can use NVidiaProfileInspectorDmW. For the best mining hashrate choose from sector "5 - Common":
CUDA - Force P2 State (Set to "Off") Power managment mode (Set to "Prefer maximum performance")
Try Tweak Win10. You can use Windows10MiningTweaksDmW (#695).
Try Optimize/Overclock GPUs. You can use MSI Afterburner for GPU OverClock/Optimize.
Try use a WatchDog You can use ETHminerWatchDogDmW (#735).
-
NVRM Xid error 59 with Kepler card (CUDA) on 4th PCIe 3.0 port
-
How to Squeeze Some Extra Performance Mining Ethereum on Nvidia
-
GTX 970 with KDE/KWIN :NVRM: Xid (PCI:0000:01:00): 31, Ch 00000028, engmask 0000...