QUDA Debugging - lattice/quda GitHub Wiki
compute-sanitizer
The CUDA toolkit include a valgrind-like tool compute-sanitizer
which can be used to isolate memory errors. However, due to QUDA's large number of kernels, the default patching behavior of the tool can result in the host system running out of CPU memory, resulting in virtual memory swapping causing a huge slowdown. To rectify this issue, set the environment variable CUDA_MEMCHECK_PATCH_MODULE=1
. Note this setting requires an NVIDIA driver 418 or greater.
Sanitizers
QUDA supports address and undefined sanitizers (running on the host) to aid in finding subtle, difficult to find memory bugs. To enable the use of sanitizers, ensure that you build with g++ or clang++, and set -DCMAKE_BUILD_TYPE=SANITIZE
. This will build QUDA with both the address and undefined sanitizers enabled (-fsanitize=address,undefined
).
Due to conflicts with running the CUDA driver with address sanitizer, it is necessary to set the run-time environment variable ASAN_OPTIONS="protect_shadow_gap=0"
when running. Note that this environment variable is not needed if using the internal QUDA test programs, which automatically set this. Memory leak checking can be disable if desired with detect_leaks=0
(default is enabled).
To run with maximum checking enabled use
ASAN_OPTIONS=protect_shadow_gap=0,strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1
When using the clang compiler (as opposed to gcc), to ensure that stack traces are correctly printed, you will also need to set the additional environment variable ASAN_SYMBOLIZER_PATH
to point to the appropriate llvm-symbolizer. In doing so, it appears you need to set this to be the absolute path and not using any soft links. E.g., when using clang-6 on nvsocal2:
ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-6.0/bin/llvm-symbolizer
The use of these run-time sanitizers has been critical in finding bugs in QUDA, and should be complementary to running with valgrind
(but not at the same time).
Note that on a some systems it has been observed that linking errors can be triggered when ASAN is enabled if a mixture of gcc for compilation and clang for linking. Specifically, linking errors of the form
../lib/libquda.so: undefined reference to `__ubsan_handle_type_mismatch'
clang: error: linker command failed with exit code 1 (use -v to see invocation)
have been observed. A resolution to this issue is to explicitly add the required linking flags to include the ubsan
library associated with the gcc compiler. For example this edit to CMakeLists.txt.
-set(CMAKE_EXE_LINKER_FLAGS_SANITIZE ${CMAKE_EXE_LINKER_FLAGS_SANITIZE} "-fsanitize=address,undefined")
+set(CMAKE_EXE_LINKER_FLAGS_SANITIZE ${CMAKE_EXE_LINKER_FLAGS_SANITIZE} "-fsanitize=address,undefined -L/usr/lib/gcc/x86_64-linux-gnu/7 -lubsan")
fixes the above issue on Ubuntu 18.04.
Linters
One can configure cmake to use clang-tidy
which will run on the non-CUDA files with this command
cmake "-DCMAKE_CXX_CLANG_TIDY=/usr/bin/clang-tidy-8;-checks=*" $PATH_TO_QUDA
When building, you be greeted with a slew of warnings, that may or may not be important. At present this will do nothing for CUDA source code files (*.cu) though we may find a solution to this in the future.
Backwards
QUDA supports use of the stack trace tool Backward, which will print to screen a backtrace whenever an error is found in QUDA. This can aid debugging faults. To enable this feature set QUDA_BACKWARDS=ON
.