matrix multiplication algorithm - yszheda/wiki GitHub Wiki

Emulation of FP32 and FP64 matmuls

DGEMM on Integer Matrix Multiplication Unit

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit

Hardware Trends Impacting Floating-Point Computations In Scientific Applications

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Optimization

TVM

CUDA

Marlin kernel

BLAS

Systolic Array