CUDA - yszheda/wiki GitHub Wiki
- https://stackoverflow.com/questions/24254975/measure-the-overhead-of-context-switching-in-gpu
- https://stackoverflow.com/questions/6605581/what-is-the-context-switching-mechanism-in-gpu
- Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h
- How to properly link cuda header file with device functions?
- intrinsic math functions for float2, float4
- SIMD intrinsics - are they usable on gpus?
- Can CUDA use SIMD extensions?
- Performance in CUDA
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#formatted-output
- printf inside CUDA global function
typedef union {
float4 vec;
float a[4];
} U4;
U4 u;
for (int i = 0; i < 4; ++i) u.a[i] = ...;
- Float16 and Quantized Int8 Type
- Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?
- How FP32 and FP16 units are implemented in GP100 GPU's
- fp16 support in cuda thrust
- error when trying to use half (fp16)
- (github nccl) Undefined identifiers in all_reduce.cu
- CUDA compilation error: __hmul and __hneg are undefined
- https://stackoverflow.com/questions/37133128/use-of-half2-in-cuda
- https://stackoverflow.com/questions/43120062/cuda-cublas-and-half-precision-data-types
-
Get rid of busy waiting during asynchronous cuda stream executions
-
Does CPU waits for DEVICE to let it finish its kernel execution…?
- http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-callbacks
- cuda streams: callback not getting called after stream execution
- CUDA Example: Stream Callbacks
- how can I use cudaStreamAddCallback() with a class member method?
- https://github.com/sarvex/multicore/blob/fb6ce5e6814c1b63044e4a40573de8ad687e6a4b/Chapter6_GPU/memcpyTestCallback.cu
- Why does cudaStreamAddCallback serialize kernel execution and break concurrency?
In almost all cases vectorized loads are preferable to scalar loads. Note however that using vectorized loads increases register pressure and reduces overall parallelism. So if you have a kernel that is already register limited or has very low parallelism, you may want to stick to scalar loads. Also, as discussed earlier, if your pointer is not aligned or your data type size in bytes is not a power of two you cannot use vectorized loads.
- How can I load the 128 bit data the fastest and with compatibility both GPU (CUDA C++) and with CPU (C++)?
- Efficiency of CUDA vector types (float2, float3, float4)
- Are there advantages to using the CUDA vector types?
- Why are CUDA vector types (int4, float4) faster?
- Vector operations in cuda?
- How to properly cast a global memory array using the uint4 vector in CUDA to increase memory throughput?
CUDA 9 NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort.
One way to improve filtering performance is to use shared memory atomics.
Another approach is to first use a parallel prefix sum to compute the output index of each element.
- Threads in the warp elect a leader thread.
- Threads in the warp compute the total atomic increment for the warp.
- The leader thread performs an atomic add to compute the offset for the warp.
- The leader thread broadcasts the offset to all other threads in the warp.
- Each thread adds its own index within the warp to the warp offset to get its position in the output array.
-
atomicAdd and shared memory issue Running the histogram code from "Cuda by example" book.
-
CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics
-
How to use atomicCAS for multiple variables with conditionals in CUDA
-
How much faster are atomicAdd() operations to shared on SM >= 5X?
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-vote-functions
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-shuffle-functions
-
In a SIMD architecture, each instruction applies the same operation in parallel across many data elements. SIMD is typically implemented using processors with vector registers and execution units; a scalar thread issues vector instructions that execute in SIMD fashion.
-
In a SIMT architecture, rather than a single thread issuing vector instructions applied to data vectors, multiple threads issue common instructions to arbitrary data.
#define FULL_MASK 0xffffffff
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(FULL_MASK, val, offset);
For a thread at lane X in the warp,
__shfl_down_sync(FULL_MASK, val, offset)
gets the value of theval
variable from the thread at lane X+offset of the same warp.
__activemask()
void __syncwarp(unsigned mask=0xffffffff);
The
__syncwarp()
primitive causes the executing thread to wait until all threads specified in mask have executed a__syncwarp()
(with the same mask) before resuming execution. It also provides a memory fence to allow threads to communicate via memory before and after calling the primitive.
Make sure that
__syncwarp()
separates shared memory reads and writes to avoid race conditions.
C.2.4. Coalesced Groups
In CUDA’s SIMT architecture, at the hardware level the multiprocessor executes threads in groups of 32 called warps. If there exists a data-dependent conditional branch in the application code such that threads within a warp diverge, then the warp serially executes each branch disabling threads not on that path. The threads that remain active on the path are referred to as coalesced.
C.2.5.1. Discovery Pattern
{
unsigned int writemask = __activemask();
unsigned int total = __popc(writemask);
unsigned int prefix = __popc(writemask & __lanemask_lt());
// Find the lowest-numbered active lane
int elected_lane = __ffs(writemask) - 1;
int base_offset = 0;
if (prefix == 0) {
base_offset = atomicAdd(p, total);
}
base_offset = __shfl_sync(writemask, base_offset, elected_lane);
int thread_offset = prefix + base_offset;
return thread_offset;
}
{
cg::coalesced_group g = cg::coalesced_threads();
int prev;
if (g.thread_rank() == 0) {
prev = atomicAdd(p, g.size());
}
prev = g.thread_rank() + g.shfl(prev, 0);
return prev;
}
coalesced_group active = coalesced_threads();
Keep in mind that since threads from different warps are never coalesced, the largest group that
coalesced_threads()
can return is a full warp.
- Cuda atomics change flag
- Try to use lock and unlock in CUDA
- CUDA, mutex and atomicCAS()
- Lock reading/writing for rows in two dimension array in global memory
- Implementing a critical section in CUDA
- Atomic float operations. especially add
- atomic read or write
- CUDA: Forgetting kernel launch configuration does not result in NVCC compiler warning or error
- CUDA kernel launch parameters explained right?
block size超过限制大小
- https://stackoverflow.com/questions/37323053/misaligned-address-in-cuda
- https://stackoverflow.com/questions/12778949/cuda-memory-alignment
- https://en.cppreference.com/w/cpp/language/alignas
- CUDA Runtime API error 74: misaligned address
// correct
kernel<<< blocks, threads, bytes, streamID >>>();
// wrong
kernel<<< blocks, threads, streamID >>>();
memcpy: 非法地址
- cudaMemcpy returns invalid value
- https://stackoverflow.com/questions/3079880/cuda-cudamemcpy-returns-cudaerrorinvalidvalue-for-device-array
-
CUDA C++: Using a template function which calls a template kernel
-
template cuda kernel function cannot be called in another template function on vs2013
-
Unresolved externals in CUDA expression template library under Visual Studio 2010
// kernel.cu
template <class T>
__global__ void kernel_axpy(T* x, T* y, int len) { ... }
void axpy(float* x, float* y, int len){ kernel_axpy<<<...>>>(x,y,len); }
void axpy(double* x, double* y, int len){ kernel_axpy<<<...>>>(x,y,len); }
// axpy.h
extern void axpy(float* x, float* y, int len);
extern void axpy(double* x, double* y, int len);
template <class T> void cpp_axpy(T* x, T* y, int len) { std::cerr<<"Not implemented.\n"<<std::endl; }
template <> void cpp_axpy<float>(float* x, float* y, int len) { axpy(x,y,len); }
template <> void cpp_axpy<double>(double* x, double* y, int len) { axpy(x,y,len); }
// main.cpp
#include "axpy.h"
...
{
axpy(xx,yy,length);
cpp_axpy<double>(xxx,yyy,lll);
}
...
- how solve [extern "C" template]?
- How to make CUDA object file with C linkage?
- Is extern “C” no longer needed anymore in cuda? [closed]
- Compiling C and CUDA code Problems linking CUDA code and C code
- CUDA学习笔记2
- CUDA ,extern "C" --NVCC编译器的问题, whole program compilation与separate compilation
- Noob Q: How to extern c function?
- [!]host float constant usage in a kernel in CUDA
- How can I use static const members in CUDA?
- CUDA - Using constant variables and cudaMemcpyFromSymbol
- constant memory which is device-side only (avoiding cudaMemcpyToSymbol)
Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFree.
-
cudaErrorCudartUnloading (error 29) due to “driver shutting down”
-
出现这个错误是在程序成功运行之后出现的,也就是说,所有的功能可以正常实现,最后报了这个错误。经过不断尝试,找到原因由于加载模型时。这些变量定义全局变量,导致上面的问题。详细原因暂时没有思路。
From community wiki
Your code is unknowingly relying on undefined behaviour (the order of destruction of translation unit objects) and there is no real workaround other than to explicitly control and lifespan of objects containing CUDA runtime API calls in their destructor, or simply avoid using those API calls in destructors altogether.
In detail:
The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error. C++ doesn't define the order of destruction when objects fall out of scope. Your singleton or object needs to be destroyed before the CUDA context is torn down, but there is no guarantee that will occur. This is effectively undefined behaviour.
From Robert Crovella
The placement of CUDA calls in a global object outside of main scope will lead to problematic behavior. See here. Although that description mostly focuses on kernel calls in such a class/object, the hazard applies to any CUDA call, as you have discovered.
To be clear, I should have said "The placement of CUDA calls in constructors and destructors of a global object outside of main scope will lead to problematic behavior. " Use of CUDA in other class methods may be possible (assuming e.g these methods don't get called by constructors/destructors, etc.)
From talonmies
There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error.
For instance, can I have my class maintain certain variables/handles that will force cuda run time library to stay loaded.
No. It is a bad design practice to put calls to the CUDA runtime API in constructors that may run before main and destructors that may run after main.
- Global static destructors called when a dynamic library is unloaded?
- CUDA: Why is it not possible to define static global member functions?
- Why is there no support for static class members?
- /lib64/libcuda.so
- /lib64/libnvidia-fatbinaryloader.so
From talonmies
The obvious answer is don't put CUDA API calls in the destructor. In your class you have an explicit intialisation method not called through the constructor, so why not have an explicit de-initialisation method as well? That way scope becomes a non-issue
- https://github.com/apache/incubator-mxnet/issues/4219
-
-
MXNotifyShutdown
(https://github.com/apache/incubator-mxnet/blob/master/src/c_api/c_api.cc)
-
int gpu_num;
cudaError_t err = cudaGetDeviceCount(&gpu_num);
std::atexit([](){
// Call CUDA APIs to clean up
});
int device_id = 0, result = 0;
cudaDeviceGetAttribute (&result, cudaDevAttrConcurrentManagedAccess, device_id);
if (result) {
// Call cudaMemAdvise
}
this error has been reported several times, usually being resolved as the GPU's fault, not Caffe's.
From rizwansarwar
Some more information, depending on your driver version, you get different crash error. So at I got 381.22 driver version, I got illegal memory error, but at 375.66 I get unspecified launch failure.
From derubm
Illegal memory access error is in case of Nvidia cards happen due to having a card running on max overclocked memory on Power state 2. When your miner does switch to P0 state for whatever reason, memory gets an additional 200 mhz and can (or will) get unstable, which causes this error.
From GPU Performance State Interface
P-States are GPU active/executing performance capability states. They range from P0 to P15, with P0 being the highest performance state, and P15 being the lowest performance state. Each P-State, if available, maps to a performance level. Not all P-States are available on a given system. The definition of each P-States are currently as follow: - P0/P1 - Maximum 3D performance - P2/P3 - Balanced 3D performance-power - P8 - Basic HD video playback - P10 - DVD playback - P12 - Minimum idle power consumption
GTX1060 +150/+500/65%TDP @ 23-24MHs
Try Update Drivers. Download and install the latests.
Try Update Ethminer. Download (or beter build) the latest.
Try use -U for CUDA devices. CUDA Hardware Test Launch Command: ethminer -RH -U -S eu1.ethermine.org:4444 -FS us1.ethermine.org:4444 -O 0x7013275311fc37ccc1e40193D75086293eCb43A4.issue128
Try to change P2 State and Power managment mode. You can use NVidiaProfileInspectorDmW. For the best mining hashrate choose from sector "5 - Common":
CUDA - Force P2 State (Set to "Off") Power managment mode (Set to "Prefer maximum performance")
Try Tweak Win10. You can use Windows10MiningTweaksDmW (#695).
Try Optimize/Overclock GPUs. You can use MSI Afterburner for GPU OverClock/Optimize.
Try use a WatchDog You can use ETHminerWatchDogDmW (#735).
-
NVRM Xid error 59 with Kepler card (CUDA) on 4th PCIe 3.0 port
-
How to Squeeze Some Extra Performance Mining Ethereum on Nvidia
-
GTX 970 with KDE/KWIN :NVRM: Xid (PCI:0000:01:00): 31, Ch 00000028, engmask 0000...