matrix multiplication algorithm - yszheda/wiki GitHub Wiki
Optimization
TVM
CUDA
-
Nvidia CUDA Core-CUDA HGEMV Optimization : How to extremely optimize CUDA HGEMV with CUDA Core?
-
https://github.com/xgqdut2016/hpc_project/tree/main/cuda/matrix
Marlin kernel
BLAS
- https://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance
- https://www.quora.com/What-algorithm-does-BLAS-use-for-matrix-multiplication-Of-all-the-considerations-e-g-cache-popular-instruction-sets-Big-O-etc-which-one-turned-out-to-be-the-primary-bottleneck
- GEMM: From Pure C to SSE Optimized Micro Kernels
- OpenBLAS矩阵乘法源码结构分析
- OpenBLAS项目与矩阵乘法优化 | AI 研习社