list of abbreviations - bamler-lab/cutlass-gemv GitHub Wiki

CTA cooperative thread array: A group of threads, nearly synonymous with a thread block, that executes in a SIMT (Single Instruction, Multiple Threads) fashion—organized in warps. Threads within a CTA can communicate via shared memory.
clusters of CTA's: An optional intermediate layer, available on devices with compute capability ≥ sm90
wmma Warp Matrix Multiply-Accumulate: specialized functions (in C++)/ instructions (in PTX) define matrix operations on a warp level
Cuda Graphs: defines a sequence of kernel calls, host functions or memcpys which can be recorded via streams. Recorded graphs have reduced overhead for calling kernels.
HBM high bandwidth memory: memory interface between host and device (current version is HBM3E)
MLIR Multi-Level Intermediate Representation