Tensor Core - yszheda/wiki GitHub Wiki

Architecture


NVIDIA Tensor Core Evolution: From Volta To Blackwell

Tensor Core Architecture Evolution

Volta

Why NVIDIA Added Tensor Cores
1st Generation Tensor Core - Warp-scoped MMA

Turing (sm75)

  • INT8 and INT4

Ampere (sm80)

Asynchronous Data Copy

Pre-Ampere: MMA instructions have high register usage and must share the register file with data-loading operations, causing high register pressure and wasting memory bandwidth for copying data in and out of RF.

3rd Generation Tensor Core - Warp-level Synchronous MMA

Hopper (sm90)

Thread Block Cluster
Tensor Memory Accelerator
    • Tensor Memory Accelerator (TMA) to each Hopper SM
  • TMA frees up threads to execute other independent work, handling address generation and offering additional benefits such as out-of-bounds handling.
  • cp.async.bulk
  • However, for small requests, TMA loads have higher latency than regular async data copies because of the address generation overhead. => in LLM inference, TMA is not suitable for workloads that load KV cache in small chunks, but works well when each chunk is a multiple of 16 bytes.
  • TMA supports a mode of loading data called multicast => reduces L2 cache traffic and subsequently reduces HBM traffic
4th Generation Tensor Core - Warpgroup-level Asynchronous MMA

While all threads in a warpgroup collectively hold the output matrix in their registers, Hopper Tensor Cores can directly load operands from shared memory instead of registers, saving register space and bandwidth. Specifically, operand matrix A can reside in either registers or shared memory, while operand matrix B can only be accessed through shared memory.

References:

Blackwell (sm100)

Tensor Memory
    • Tensor Memory (TMEM) specialized for Tensor Core operations
  • On every SM, TMEM has 128 rows (lanes) and 512 columns of 4-byte cells, totaling to 256 KB, which is also the size of the register file on an SM.
  • restricted memory access pattern: it takes a warpgroup to access the whole TMEM, and each warp in a warpgroup can only access a specific set of lanes. =>
    • hardware designers can reduce the number of access ports, saving chip space.
    • epilogue operations need a warpgroup to operate.
CTA Pair

A CTA pair maps to a Texture Processing Cluster (TPC), which consists of two SMs and combines with other TPCs to form a GPC. When Blackwell Tensor Core operations perform at a CTA pair granularity, the two CTAs are able to share input operands. => reduces both SMEM capacity and bandwidth requirements.

Tensor Core 5th Generation MMA
    • tcgen05.mma: single thread semantics
  • Operands now reside in shared memory and Tensor Memory.
    • MMA.2SM

Side Note: Structured Sparsity

Tensor Core Size Increases

  • Cons:

  • Pros:

    • Having larger MMA shapes enhances the operand sharing granularity. Specifically, launching fewer larger tiles would increase the data reuse, saving memory footprint and bandwidth of RF and SMEM.
      • a quadpair of 8 threads (Volta) -> a warp of 32 threads (Ampere) -> a warpgroup of 128 threads (Hopper)

Memory Size Increase

Tensor Core throughput doubled every generation, but global memory load latency didn’t decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data.

Asynchrony of MMA Instruction

  • MMA from synchronous to asynchronous => overlap LDSM instructions.

Data Type Precision Reduction


Hopper

Blackwell


Layout

NC/32HW32 Memory Layout