Tensor Core - yszheda/wiki GitHub Wiki
Architecture
NVIDIA Tensor Core Evolution: From Volta To Blackwell
Tensor Core Architecture Evolution
Volta
Why NVIDIA Added Tensor Cores
1st Generation Tensor Core - Warp-scoped MMA
Turing (sm75)
- INT8 and INT4
Ampere (sm80)
Asynchronous Data Copy
Pre-Ampere: MMA instructions have high register usage and must share the register file with data-loading operations, causing high register pressure and wasting memory bandwidth for copying data in and out of RF.
3rd Generation Tensor Core - Warp-level Synchronous MMA
- warp-level
ldmatrix -
- BF16
Hopper (sm90)
Thread Block Cluster
-
- cooperative grid array (CGA)
-
- distributed shared memory (DSMEM)
Tensor Memory Accelerator
-
- Tensor Memory Accelerator (TMA) to each Hopper SM
- TMA frees up threads to execute other independent work, handling address generation and offering additional benefits such as out-of-bounds handling.
cp.async.bulk- However, for small requests, TMA loads have higher latency than regular async data copies because of the address generation overhead.
=> in LLM inference, TMA is not suitable for workloads that load KV cache in small chunks, but works well when each chunk is a multiple of 16 bytes.
- SGLang prefix caching
- paper FlashInfer section 3.2.1
- paper Hardware-Efficient Attention for Fast Decoding section 4.2
- ThunderKittens MLA decode
- TMA supports a mode of loading data called multicast => reduces L2 cache traffic and subsequently reduces HBM traffic
4th Generation Tensor Core - Warpgroup-level Asynchronous MMA
While all threads in a warpgroup collectively hold the output matrix in their registers, Hopper Tensor Cores can directly load operands from shared memory instead of registers, saving register space and bandwidth. Specifically, operand matrix A can reside in either registers or shared memory, while operand matrix B can only be accessed through shared memory.
-
- FP8 (E4M3 and E5M2)
References:
-
GTC talk: Inside the NVIDIA Hopper Architecture
-
NVIDIA blog post overview: NVIDIA Hopper Architecture In-Depth
-
Whitepaper: NVIDIA H100 Tensor Core GPU Architecture
-
Microbenchmarking: Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
-
Microbenchmarking: Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis
-
Programming:
Blackwell (sm100)
Tensor Memory
-
- Tensor Memory (TMEM) specialized for Tensor Core operations
- On every SM, TMEM has 128 rows (lanes) and 512 columns of 4-byte cells, totaling to 256 KB, which is also the size of the register file on an SM.
- restricted memory access pattern: it takes a warpgroup to access the whole TMEM, and each warp in a warpgroup can only access a specific set of lanes. =>
- hardware designers can reduce the number of access ports, saving chip space.
- epilogue operations need a warpgroup to operate.
CTA Pair
A CTA pair maps to a Texture Processing Cluster (TPC), which consists of two SMs and combines with other TPCs to form a GPC. When Blackwell Tensor Core operations perform at a CTA pair granularity, the two CTAs are able to share input operands. => reduces both SMEM capacity and bandwidth requirements.
Tensor Core 5th Generation MMA
-
tcgen05.mma: single thread semantics
- Operands now reside in shared memory and Tensor Memory.
-
MMA.2SM
-
- convolutions: weight stationary MMA instruction
-
- microscaling floating-point format (MXFP), including MXFP8, MXFP6, and MXFP4.
Side Note: Structured Sparsity
- Ampere: 2:4 structured sparsity
- Blackwell: pair‑wise 4 : 8 structured sparsity for the NVFP4 data type.
Tensor Core Size Increases
-
Cons:
- having a large number of cores suffer from the tile quantization effect
- having a large core size leads to wave quantization effect.
-
Pros:
- Having larger MMA shapes enhances the operand sharing granularity. Specifically, launching fewer larger tiles would increase the data reuse, saving memory footprint and bandwidth of RF and SMEM.
- a quadpair of 8 threads (Volta) -> a warp of 32 threads (Ampere) -> a warpgroup of 128 threads (Hopper)
- Having larger MMA shapes enhances the operand sharing granularity. Specifically, launching fewer larger tiles would increase the data reuse, saving memory footprint and bandwidth of RF and SMEM.
Memory Size Increase
Tensor Core throughput doubled every generation, but global memory load latency didn’t decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data.
Asynchrony of MMA Instruction
MMAfrom synchronous to asynchronous => overlapLDSMinstructions.
Data Type Precision Reduction
Hopper
- NVIDIA Hopper架构TensorCore分析(1)
- NVIDIA Hopper架构TensorCore分析(2)
- NVIDIA Hopper架构TensorCore分析(3)
- NVIDIA Hopper架构TensorCore分析(4)
Blackwell
-
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma
-
Nvidia Tensor Core-Getting Started with WMMA API Programming
Layout
- CUDA Tensor Layouts for Convolution
- NHWC vs NCHW : A memory access perspective on GPUs
- How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?
- tensorflow layout optimizer && conv autotune