Tensor Core - yszheda/wiki GitHub Wiki

Architecture

NVIDIA Tensor Core Evolution: From Volta To Blackwell

Tensor Core Architecture Evolution

Volta

Why NVIDIA Added Tensor Cores

1st Generation Tensor Core - Warp-scoped MMA

Turing (sm75)

INT8 and INT4

Ampere (sm80)

Asynchronous Data Copy

Pre-Ampere: MMA instructions have high register usage and must share the register file with data-loading operations, causing high register pressure and wasting memory bandwidth for copying data in and out of RF.

3rd Generation Tensor Core - Warp-level Synchronous MMA

Hopper (sm90)

Thread Block Cluster

- cooperative grid array (CGA)
- distributed shared memory (DSMEM)

Tensor Memory Accelerator

- Tensor Memory Accelerator (TMA) to each Hopper SM
TMA frees up threads to execute other independent work, handling address generation and offering additional benefits such as out-of-bounds handling.
cp.async.bulk
However, for small requests, TMA loads have higher latency than regular async data copies because of the address generation overhead. => in LLM inference, TMA is not suitable for workloads that load KV cache in small chunks, but works well when each chunk is a multiple of 16 bytes.
- SGLang prefix caching
- paper FlashInfer section 3.2.1
- paper Hardware-Efficient Attention for Fast Decoding section 4.2
- ThunderKittens MLA decode
TMA supports a mode of loading data called multicast => reduces L2 cache traffic and subsequently reduces HBM traffic
- Jay Shah’s TMA tutorial

4th Generation Tensor Core - Warpgroup-level Asynchronous MMA

While all threads in a warpgroup collectively hold the output matrix in their registers, Hopper Tensor Cores can directly load operands from shared memory instead of registers, saving register space and bandwidth. Specifically, operand matrix A can reside in either registers or shared memory, while operand matrix B can only be accessed through shared memory.

CUDA documentation wgmma section

- FP8 (E4M3 and E5M2)
- the accumulation path was implemented as a 22-bit fixed-point format (13-bit mantissa plus sign and exponent bits)

References:

GTC talk: Inside the NVIDIA Hopper Architecture
NVIDIA blog post overview: NVIDIA Hopper Architecture In-Depth
Whitepaper: NVIDIA H100 Tensor Core GPU Architecture
Microbenchmarking: Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
Microbenchmarking: Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis
Programming:
- GTC talk: Optimizing Applications for Hopper Architecture
- CUTLASS talk: Developing Optimal CUDA Kernels on Hopper Tensor Cores
- Colfax blog post: CUTLASS Tutorial: Fast Matrix-Multiplication with WGMMA on NVIDIA Hopper GPUs

Blackwell (sm100)

Tensor Memory

- Tensor Memory (TMEM) specialized for Tensor Core operations
On every SM, TMEM has 128 rows (lanes) and 512 columns of 4-byte cells, totaling to 256 KB, which is also the size of the register file on an SM.
restricted memory access pattern: it takes a warpgroup to access the whole TMEM, and each warp in a warpgroup can only access a specific set of lanes. =>
- hardware designers can reduce the number of access ports, saving chip space.
- epilogue operations need a warpgroup to operate.

CTA Pair

A CTA pair maps to a Texture Processing Cluster (TPC), which consists of two SMs and combines with other TPCs to form a GPC. When Blackwell Tensor Core operations perform at a CTA pair granularity, the two CTAs are able to share input operands. => reduces both SMEM capacity and bandwidth requirements.

Tensor Core 5th Generation MMA

- tcgen05.mma: single thread semantics
Operands now reside in shared memory and Tensor Memory.

- MMA.2SM

- convolutions: weight stationary MMA instruction
- microscaling floating-point format (MXFP), including MXFP8, MXFP6, and MXFP4.
- Microscaling Data Formats for Deep Learning

Side Note: Structured Sparsity

Ampere: 2:4 structured sparsity
- Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis
Blackwell: pair‑wise 4 : 8 structured sparsity for the NVFP4 data type.

Tensor Core Size Increases

Cons:
- having a large number of cores suffer from the tile quantization effect
- having a large core size leads to wave quantization effect.
Pros:
- Having larger MMA shapes enhances the operand sharing granularity. Specifically, launching fewer larger tiles would increase the data reuse, saving memory footprint and bandwidth of RF and SMEM.
  - a quadpair of 8 threads (Volta) -> a warp of 32 threads (Ampere) -> a warpgroup of 128 threads (Hopper)

Memory Size Increase

Tensor Core throughput doubled every generation, but global memory load latency didn’t decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data.

Asynchrony of MMA Instruction

MMA from synchronous to asynchronous => overlap LDSM instructions.

Data Type Precision Reduction

Hopper

Blackwell

NVIDIA Blackwell架构TensorCore分析(1)

Layout

NC/32HW32 Memory Layout

https://docs.nvidia.com/deeplearning/cudnn/frontend/latest/developer/core-concepts.html#nc-32hw32-memory-layout