QUDA Debugging - lattice/quda GitHub Wiki

compute-sanitizer

The CUDA toolkit include a valgrind-like tool compute-sanitizer which can be used to isolate memory errors. However, due to QUDA's large number of kernels, the default patching behavior of the tool can result in the host system running out of CPU memory, resulting in virtual memory swapping causing a huge slowdown. To rectify this issue, set the environment variable CUDA_MEMCHECK_PATCH_MODULE=1. Note this setting requires an NVIDIA driver 418 or greater.

Sanitizers

QUDA supports address and undefined sanitizers (running on the host) to aid in finding subtle, difficult to find memory bugs. To enable the use of sanitizers, ensure that you build with g++ or clang++, and set -DCMAKE_BUILD_TYPE=SANITIZE. This will build QUDA with both the address and undefined sanitizers enabled (-fsanitize=address,undefined).

Due to conflicts with running the CUDA driver with address sanitizer, it is necessary to set the run-time environment variable ASAN_OPTIONS="protect_shadow_gap=0" when running. Note that this environment variable is not needed if using the internal QUDA test programs, which automatically set this. Memory leak checking can be disable if desired with detect_leaks=0 (default is enabled).

To run with maximum checking enabled use

ASAN_OPTIONS=protect_shadow_gap=0,strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1

When using the clang compiler (as opposed to gcc), to ensure that stack traces are correctly printed, you will also need to set the additional environment variable ASAN_SYMBOLIZER_PATH to point to the appropriate llvm-symbolizer. In doing so, it appears you need to set this to be the absolute path and not using any soft links. E.g., when using clang-6 on nvsocal2:

ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-6.0/bin/llvm-symbolizer

The use of these run-time sanitizers has been critical in finding bugs in QUDA, and should be complementary to running with valgrind (but not at the same time).

Note that on a some systems it has been observed that linking errors can be triggered when ASAN is enabled if a mixture of gcc for compilation and clang for linking. Specifically, linking errors of the form

../lib/libquda.so: undefined reference to `__ubsan_handle_type_mismatch'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

have been observed. A resolution to this issue is to explicitly add the required linking flags to include the ubsan library associated with the gcc compiler. For example this edit to CMakeLists.txt.

-set(CMAKE_EXE_LINKER_FLAGS_SANITIZE ${CMAKE_EXE_LINKER_FLAGS_SANITIZE} "-fsanitize=address,undefined")
+set(CMAKE_EXE_LINKER_FLAGS_SANITIZE ${CMAKE_EXE_LINKER_FLAGS_SANITIZE} "-fsanitize=address,undefined -L/usr/lib/gcc/x86_64-linux-gnu/7 -lubsan")

fixes the above issue on Ubuntu 18.04.

Linters

One can configure cmake to use clang-tidy which will run on the non-CUDA files with this command

cmake "-DCMAKE_CXX_CLANG_TIDY=/usr/bin/clang-tidy-8;-checks=*" $PATH_TO_QUDA

When building, you be greeted with a slew of warnings, that may or may not be important. At present this will do nothing for CUDA source code files (*.cu) though we may find a solution to this in the future.

Backwards

QUDA supports use of the stack trace tool Backward, which will print to screen a backtrace whenever an error is found in QUDA. This can aid debugging faults. To enable this feature set QUDA_BACKWARDS=ON.