matrix multiplication algorithm - yszheda/wiki GitHub Wiki

Emulation of FP32 and FP64 matmuls

DGEMM on Integer Matrix Multiplication Unit

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit

Hardware Trends Impacting Floating-Point Computations In Scientific Applications

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

FP8 FP4 Quantization

Optimization

TVM

CUDA

DeepGEMM

DeepGEMM V2

Marlin kernel

Autotuning

BLAS

Systolic Array