`cuda-memcheck`

cannot find mem leaks

https://docs.nvidia.com/cuda/cuda-memcheck/index.html#leak-checking

For an accurate leak checking summary to be generated, the application's CUDA context must be destroyed at the end. This can be done explicitly by calling cuCtxDestroy() in applications using the CUDA driver API, or by calling cudaDeviceReset() in applications programmed against the CUDA run time API.

The --leak-check full option must be specified to enable leak checking.

nvprof and Nsight Visual Studio Edition are not being run simultaneously

    Internal Memcheck Error: Memcheck failed initialization
    as some other tools is currently attached. Please make 
    sure that nvprof and Nsight Visual Studio Edition are 
    not being run simultaneously

unset CUDA_INJECTION32_PATH
unset CUDA_INJECTION64_PATH

Unspecified launch failure

`nvprof`

nvprof --events all --metrics all <your application>

CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

`nvcc`

https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html

cuda-compilation-from-cu-to-executable

--default-stream {legacy|null|per-thread} (-default-stream)

4.2.7. Options for Steering GPU Code Generation

--gpu-architecture arch (-arch) Specify the name of the class of NVIDIA virtual GPU architecture for which the CUDA input files must be compiled.

--gpu-code code,... (-code) Specify the name of the NVIDIA GPU to assemble and optimize PTX for. During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no binary load image is found for the current GPU.

5. GPU Compilation

The virtual architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run.

5.6.1. Just-in-Time Compilation

The disadvantage of just in time compilation is increased application startup delay, but this can be alleviated by letting the CUDA driver use a compilation cache which is persistent over multiple runs of the applications.

GPU Feature List

`sm_30` and `sm_32`	Basic features + Kepler support + Unified memory programming
`sm_35`	+ Dynamic parallelism support
`sm_50`, `sm_52`, and `sm_53`	+ Maxwell support
`sm_60`, `sm_61`, and `sm_62`	+ Pascal support
`sm_70` and `sm_72`	+ Volta support
`sm_75`	+ Turing support

Virtual Architecture Feature List

`compute_30` and `compute_32`	Basic features + Kepler support + Unified memory programming
`compute_35`	+ Dynamic parallelism support
`compute_50`, `compute_52`, and `compute_53`	+ Maxwell support
`compute_60`, `compute_61`, and `compute_62`	+ Pascal support
`compute_70` and `compute_72`	+ Volta support
`compute_75`	+ Turing support

6. Using Separate Compilation in CUDA

Note that only static libraries are supported by the device linker.

Printing Code Generation Statistics

--resource-usage

NVCC compilation options for generating the best code (using JIT)

Compute Capacibility

The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release

`-use-fast-math`

Does -use-fast-math option translate SP multiplications to intrinsics?

`--default-stream per-thread`

`cudaStreamPerThread`

relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC

# In CMakeLists.txt:
list(APPEND CUDA_NVCC_FLAGS "--compiler-options -fPIC")

“relocation R_X86_64_32S against ” linking Error

maxrregcount

`-Xcompiler`

nvcc test.cu -std=c++11 -Xcompiler "-std=c++14"

`cuobjdump` `nvdisasm` `nvprune`

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

`nvidia-xconfig`

X server on NVIDIA card with no screen

`NSight`

Application received signal 139

Uncheck "Enable concurrent kernel profiling". NSight Profiler Signal 139

org.eclipse.swt.SWTException.: Failed to execute runnable (java.long.OutOfMemoryError: Java heap space)

`nvidia-smi`

nvidia-smi  --query | grep 'Compute Mode'
sudo nvidia-smi -c $i
# i=0 Default
# i=1 Exclusive_Thread
# i=2 Prohibited 
# i=3 Exclusive_Process

How can I flush GPU memory using CUDA (physical reset is unavailable)

sudo fuser -v /dev/nvidia*

NVML

monitor

GPU usage monitoring (CUDA)

CUDA Tools - yszheda/wiki GitHub Wiki

`cuda-memcheck`

cannot find mem leaks

nvprof and Nsight Visual Studio Edition are not being run simultaneously

Unspecified launch failure

`nvprof`

`nvcc`

4.2.7. Options for Steering GPU Code Generation

5. GPU Compilation

5.6.1. Just-in-Time Compilation

GPU Feature List

Virtual Architecture Feature List

6. Using Separate Compilation in CUDA

Printing Code Generation Statistics

Compute Capacibility

The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release

`-use-fast-math`

`--default-stream per-thread`

`cudaStreamPerThread`

relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC

maxrregcount

`-Xcompiler`

`cuobjdump` `nvdisasm` `nvprune`

`nvidia-xconfig`

`NSight`

Application received signal 139

org.eclipse.swt.SWTException.: Failed to execute runnable (java.long.OutOfMemoryError: Java heap space)

`nvidia-smi`

NVML

monitor

⚠️ GitHub.com Fallback ⚠️

CUDA Tools - yszheda/wiki GitHub Wiki

cuda-memcheck

cannot find mem leaks

nvprof and Nsight Visual Studio Edition are not being run simultaneously

Unspecified launch failure

nvprof

nvcc

4.2.7. Options for Steering GPU Code Generation

5. GPU Compilation

5.6.1. Just-in-Time Compilation

GPU Feature List

Virtual Architecture Feature List

6. Using Separate Compilation in CUDA

Printing Code Generation Statistics

Compute Capacibility

The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release

-use-fast-math

--default-stream per-thread

cudaStreamPerThread

relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC

maxrregcount

-Xcompiler

cuobjdump nvdisasm nvprune

nvidia-xconfig

NSight

Application received signal 139

org.eclipse.swt.SWTException.: Failed to execute runnable (java.long.OutOfMemoryError: Java heap space)

nvidia-smi

NVML

monitor

⚠️ **GitHub.com Fallback** ⚠️

`cuda-memcheck`

`nvprof`

`nvcc`

`-use-fast-math`

`--default-stream per-thread`

`cudaStreamPerThread`

`-Xcompiler`

`cuobjdump` `nvdisasm` `nvprune`

`nvidia-xconfig`

`NSight`

`nvidia-smi`

⚠️ GitHub.com Fallback ⚠️