matrix multiplication algorithm - yszheda/wiki GitHub Wiki

Emulation of FP32 and FP64 matmuls

DGEMM on Integer Matrix Multiplication Unit

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit

Hardware Trends Impacting Floating-Point Computations In Scientific Applications

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

FP8 FP4 Quantization

Optimization

TVM

CUDA

DeepGEMM

DeepGEMM V2

Marlin kernel

Marlin W4A16&W4A8代码走读

Autotuning

Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

BLAS

Systolic Array