list of abbreviations - bamler-lab/cutlass-gemv GitHub Wiki
- CTA cooperative thread array: A group of threads, nearly synonymous with a thread block, that executes in a SIMT (Single Instruction, Multiple Threads) fashion—organized in warps. Threads within a CTA can communicate via shared memory.
- clusters of CTA's: An optional intermediate layer, available on devices with compute capability ≥
sm90
- wmma Warp Matrix Multiply-Accumulate: specialized functions (in C++)/ instructions (in PTX) define matrix operations on a warp level
- Cuda Graphs: defines a sequence of kernel calls, host functions or memcpys which can be recorded via streams. Recorded graphs have reduced overhead for calling kernels.
- HBM high bandwidth memory: memory interface between host and device (current version is HBM3E)
- MLIR Multi-Level Intermediate Representation