CUDA Tools - yszheda/wiki GitHub Wiki
https://docs.nvidia.com/cuda/cuda-memcheck/index.html#leak-checking
For an accurate leak checking summary to be generated, the application's CUDA context must be destroyed at the end. This can be done explicitly by calling
cuCtxDestroy()
in applications using the CUDA driver API, or by callingcudaDeviceReset()
in applications programmed against the CUDA run time API.The
--leak-check full
option must be specified to enable leak checking.
Internal Memcheck Error: Memcheck failed initialization
as some other tools is currently attached. Please make
sure that nvprof and Nsight Visual Studio Edition are
not being run simultaneously
unset CUDA_INJECTION32_PATH
unset CUDA_INJECTION64_PATH
- JCuda Debugging
- https://stackoverflow.com/questions/19323560/internal-memcheck-error-memcheck-failed-initialization-as-profiler-is-attached
- 错误:Unspecified launch failure
- cudaMemcpyAsync()从device向host拷贝数据的时候报错
- http://docs.nvidia.com/gameworks/content/developertools/desktop/timeout_detection_recovery.htm
- https://stackoverflow.com/questions/13177214/disabling-tdr-for-cuda-in-windows-8
nvprof --events all --metrics all <your application>
--default-stream {legacy|null|per-thread} (-default-stream)
--gpu-architecture arch (-arch)
Specify the name of the class of NVIDIA virtual GPU architecture for which the CUDA input files must be compiled.
--gpu-code code,... (-code)
Specify the name of the NVIDIA GPU to assemble and optimize PTX for. During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no binary load image is found for the current GPU.
The virtual architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run.
The disadvantage of just in time compilation is increased application startup delay, but this can be alleviated by letting the CUDA driver use a compilation cache which is persistent over multiple runs of the applications.
sm_30 and sm_32 |
Basic features + Kepler support + Unified memory programming |
sm_35 | + Dynamic parallelism support |
sm_50, sm_52, and sm_53 | + Maxwell support |
sm_60, sm_61, and sm_62 | + Pascal support |
sm_70 and sm_72 | + Volta support |
sm_75 | + Turing support |
compute_30 and compute_32 |
Basic features + Kepler support + Unified memory programming |
compute_35 | + Dynamic parallelism support |
compute_50, compute_52, and compute_53 | + Maxwell support |
compute_60, compute_61, and compute_62 | + Pascal support |
compute_70 and compute_72 | + Volta support |
compute_75 | + Turing support |
Note that only static libraries are supported by the device linker.
--resource-usage
- Does 'code=sm_X' embed only binary (cubin) code, or also PTX code, or both?
- CUDA: How to use -arch and -code and SM vs COMPUTE
- What is the purpose of using multiple “arch” flags in Nvidia's NVCC compiler?
The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release
- https://devtalk.nvidia.com/default/topic/995286/nvcc-compiler-warning-compute_20-/?offset=6
- https://stackoverflow.com/questions/42382987/nvcc-warning-in-cuda-8-0
- http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-cuda-compilation
- https://github.com/BVLC/caffe/pull/2077
- Concurrency about default stream
- How to enable CUDA 7.0+ per-thread default stream in Visual Studio 2013?
- CUDA stream per-thread and library behaviour
- 3. Stream synchronization behavior
- CUDA per-thread and cudnn behaviour
relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC
# In CMakeLists.txt:
list(APPEND CUDA_NVCC_FLAGS "--compiler-options -fPIC")
- https://stackoverflow.com/questions/16272368/cuda-perfomance-profiling-with-nvidia-nsight-in-vs2010-nvreport-report-file
- NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide: Memory Transactions
- Use the Memory Checker
- Uncheck "Enable concurrent kernel profiling". NSight Profiler Signal 139
org.eclipse.swt.SWTException.: Failed to execute runnable (java.long.OutOfMemoryError: Java heap space)
-
Using multi-threaded programs with multiple GPUs in EXCLUSIVE_PROCESS compute mode
-
https://stackoverflow.com/questions/31731535/switch-cuda-compute-mode-to-default-mode
nvidia-smi --query | grep 'Compute Mode'
sudo nvidia-smi -c $i
# i=0 Default
# i=1 Exclusive_Thread
# i=2 Prohibited
# i=3 Exclusive_Process
sudo fuser -v /dev/nvidia*