CUDA Tools - yszheda/wiki GitHub Wiki

cuda-memcheck

cannot find mem leaks

https://docs.nvidia.com/cuda/cuda-memcheck/index.html#leak-checking

For an accurate leak checking summary to be generated, the application's CUDA context must be destroyed at the end. This can be done explicitly by calling cuCtxDestroy() in applications using the CUDA driver API, or by calling cudaDeviceReset() in applications programmed against the CUDA run time API.

The --leak-check full option must be specified to enable leak checking.

nvprof and Nsight Visual Studio Edition are not being run simultaneously

    Internal Memcheck Error: Memcheck failed initialization
    as some other tools is currently attached. Please make 
    sure that nvprof and Nsight Visual Studio Edition are 
    not being run simultaneously
unset CUDA_INJECTION32_PATH
unset CUDA_INJECTION64_PATH

Unspecified launch failure

nvprof

nvprof --events all --metrics all <your application>

nvcc


cuda-compilation-from-cu-to-executable

--default-stream {legacy|null|per-thread} (-default-stream)

4.2.7. Options for Steering GPU Code Generation

--gpu-architecture arch (-arch) Specify the name of the class of NVIDIA virtual GPU architecture for which the CUDA input files must be compiled.

--gpu-code code,... (-code) Specify the name of the NVIDIA GPU to assemble and optimize PTX for. During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no binary load image is found for the current GPU.

5. GPU Compilation

The virtual architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run.

5.6.1. Just-in-Time Compilation

The disadvantage of just in time compilation is increased application startup delay, but this can be alleviated by letting the CUDA driver use a compilation cache which is persistent over multiple runs of the applications.

GPU Feature List

sm_30 and sm_32

Basic features

+ Kepler support

+ Unified memory programming

sm_35 + Dynamic parallelism support
sm_50, sm_52, and sm_53 + Maxwell support
sm_60, sm_61, and sm_62 + Pascal support
sm_70 and sm_72 + Volta support
sm_75 + Turing support

Virtual Architecture Feature List

compute_30 and compute_32

Basic features

+ Kepler support

+ Unified memory programming

compute_35 + Dynamic parallelism support
compute_50, compute_52, and compute_53 + Maxwell support
compute_60, compute_61, and compute_62 + Pascal support
compute_70 and compute_72 + Volta support
compute_75 + Turing support

6. Using Separate Compilation in CUDA

Note that only static libraries are supported by the device linker.

Printing Code Generation Statistics

--resource-usage


Compute Capacibility

The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release

-use-fast-math

--default-stream per-thread

cudaStreamPerThread

relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC

# In CMakeLists.txt:
list(APPEND CUDA_NVCC_FLAGS "--compiler-options -fPIC")

maxrregcount

-Xcompiler

cuobjdump nvdisasm nvprune

nvidia-xconfig

NSight

Application received signal 139

org.eclipse.swt.SWTException.: Failed to execute runnable (java.long.OutOfMemoryError: Java heap space)

nvidia-smi

nvidia-smi  --query | grep 'Compute Mode'
sudo nvidia-smi -c $i
# i=0 Default
# i=1 Exclusive_Thread
# i=2 Prohibited 
# i=3 Exclusive_Process
sudo fuser -v /dev/nvidia*

NVML

monitor

⚠️ **GitHub.com Fallback** ⚠️