Installing cutlass on galvani cluster - bamler-lab/cutlass-gemv GitHub Wiki

C++ CUTLASS

to install cutlass on the galvani cluster for an A100 (adapted from here) run:

cd $WORK
git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
export CUDA_INSTALL_PATH=/usr/local/cuda
export CUDACXX=$CUDA_INSTALL_PATH/bin/nvcc
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_s8*     # NVCC_ARCHS: 80 for A100 and 75 for 2080TI, only compiles gemm_s8 kernels!
make test_unit -j8                         # this takes a while

To allocate an interactive session on an A100 use:

srun --job-name "InteractiveJob" --partition=a100-galvani --ntasks=1 --nodes=1 --gres=gpu:1 --time 1:00:00 --pty bash

then benchmark on an example problem by running ./cutlass_profiler in cutlass/build/tools/profiler folder:

cd tools profiler
make -j8 # this takes a while too
./cutlass_profiler --kernels=cutlass_tensorop_s*gemm_s8* --enable-best-kernel-for-fixed-shape --m=4096 --k=4096 --n=1 --batch_count=300 --sort-results-flops-per-sec --dist=gaussian,mean:0,stdev:3 

To compile your own cutlass code, keep in mind that the nvidia compiler nvcc is hidden in /usr/local/cuda/bin/nvcc and not globablly available. To compile use /usr/local/cuda/bin/nvcc --std=c++17 -I$WORK/cutlass/tools/util/include -I$WORK/cutlass/include ../file.cu -o output

nsys to profile pytorch kernels:

use /usr/local/bin/nsys profile --stats=true --force-overwrite true python example.py to measure latencies of torch kernels

Python Cutlass DSL

python 3.12 is required (not 3.11). Then install using pip install nvidia-cutlass-dsl.