Modern GPUs Explained - 180D-FW-2024/Knowledge-Base-Wiki GitHub Wiki

What is a GPU?

The Graphics Processing Unit (GPU) is a hardware component with growing relevance today, and applications in graphic rendering, animation, cryptocurrency mining, and machine learning. These specialized chips are optimized for specific tasks, often complex calculations with massive amounts of data that can be done in parallel. Today, as computer architecture has advanced and AI/ML has taken center stage, GPUs are often integrated directly onto chips with dedicated memory, and have grown in scale and variety to suit their various purposes.

What makes them special

We likely have all heard of a Central Processing Unit (CPU) before, which is designed to handle the vast majority of common computations tasks done by everyday computers and phones. Classically, individual CPUs operate on a Single Instruction Single Data (SISD) design, where each instruction operates on one set of data.

Meanwhile, the GPU is designed to handle a certain variety calculations with high amounts of parallelism and low demand for latency, producing high throughput in these situations at the cost of turnaround time. More tangibly, in applications ranging from AI to loading frames in games, the same instruction is used with different pieces of data to finish a task. For training AI models, the same instruction is used to multiply matrices with different weights and biases to update a neuron. Similarly, to render frames and generate worlds in video games, the same instruction will be used to calculate the position and rotation of multiple object, each having their own set of values, on various coordinate scales.

Above Figure: Simply illustrated difference in core and memory organization between CPUs and GPUs. The larger number of ALUs (represented as green squares) in the GPU is made clear, and the grouping of many ALUs into singular control units with shared cache is also shown.

CPU	GPU
Low Latency	High Latency
Low Throughput	High Throughput
SISD	SIMD / SIMT
Most computer tasks	Embarrassingly parallel tasks

SIMD

GPUs originally operated on the principle of Single Instruction Multiple Data, or SIMD, which involves repeating one instruction across thousands to millions of computational parameters. Each computational unit would need to move in lockstep, meaning that they would execute the same instruction during the same clock cycle. Each subtask must be independent from the others and easily distributed among separate computations, often called embarrassingly parallel tasks. For example, as video game models are placed into the world environment, there needs to be a mapping from model space to the real world using coordinate addition operations. A single instruction can be done on up to 5,629 objects for 25,000,000 addition calculations at once using a modern GPU!

SIMT

More recent GPUs operate on the SIMT, or Single Instruction Multiple Threads paradigm. Each thread does not need to move in lockstep with other threads, meaning that each thread has its own program counter. Each thread is a lightweight version of a CPU process with much lower overhead, simply being a series of instructions to be executed in order.

All threads also have shared memory in an L1 cache, allowing for divergence and resynchronization, along with a reduction in bottlenecks. This reduces issues faced by more complex operations run on separate cores, and hence greater use cases for GPUs.

SIMD	SIMT
Older GPUs	Most Modern GPUs
Data moves in lockstep	Data not in lockstep
One shared PC	Separate PCs
Synchronization each step	Defined synchronization points

Modern Example: GA102 Architecture

As an architectural example, we can look at an Nvidia chip released in 2020. Nvidia's GeForce RTX 3090, 3080, 3090ti, 3080ti series cards all rely on the GA102 chip. Interestingly enough, while the whole series has different price points and performance results, they all use the same model of chip, with different numbers of defects during the manufacturing process determining the final card series.

Above Figure: Labeled Infrared Image of Nvidia's GA102 Silicon with various cores, memory, cache, and I/O visible

Each chip is surrounded by 32-bit GDDR6x SDRAM, with a 384-bit wide bus capable of transferring data across the chip at 1.15 TB/s.

This is very fast compared to replaceable CPU main memory, with a 64-bit wide bus with speeds up to 64 GB/s.
Nvidia accomplished this with 3 different voltage levels and a complex encoding scheme, sending 276 binary bits with 176 ternary bits.

Each chip also contains 7 Graphics Processing Clusters (GPCs), each containing:

12 Streaming Multiprocessors (SMs)
Ray-tracing cores
Tensor cores
CUDA cores.

In the bottom section, we can see 2 L2 Cache units with only 6 MB of capacity total (relatively little), as well as the GigaThread Engine used as a scheduler and task manager.

The chip as a whole has discrete sections designed for graphics tasks, floating point arithmetic, as well as matrix multiplication. Each task can be handled by a separate specialized core:

Tensor cores are optimized for large matrix multiplication and addition operations for machine learning applications
Ray-tracing cores are used for graphics rendering for animation or video games.

Programming for GPUs

Computer Unified Architecture (CUDA) is a software platform for allowing user-written programs to take full advantage of specialized GPU hardware on Nvidia chips, specifically designed for programs in C/C++.

Threads: CUDA allows code to be broken up into a large number of threads, designed to run in parallel on separate GPU cores.
Kernel Functions: Kernel Functions are special CUDA functions that define the computation to be performed on the GPU.
Blocks: Programs written with CUDA can make use of shared memory between threads in a block, as well as global and local memory.

So what really happens when a GPU is given a task? Once a task is given and decoded into simple instructions, an instruction is given to a thread that will complete the task. Threads with the same instruction are then grouped into a group of size 32, known as a warp, which are then further grouped into what is known as a thread block. These thread blocks are then grouped and distributed by the gigathread engine which will map thread blocks to available streaming multiprocessors (SM in the diagram). These streaming multiprocessors, made of various cores (ray tracing cores, tensor cores, and cuda cores) will then compute the instructions and then sync its newly computed data with the rest of the system

Integrated Graphics

In discussing modern graphics cards, the value of integrated graphics must not be undermined. Thus far, we have only talked about the ins and outs of dedicated graphics cards which are a separate piece of hardware that connects to the motherboard to handle intensive computations. Integrated graphic processing units (IGPUs) on the other hand, as implied by the name, are units that are either directly installed to the motherboard or, more commonly, come with the CPU. Unlike dedicated GPUS, integrated graphic cards don’t have dedicated VRAM and must share RAM with the CPU. While the distinction seems minor, at its core, VRAM, or Video Random Access Memory, carries the same purpose as RAM but is specialized to cache graphics data that could store information ranging from a frame in a scene to the framework of a neural network. RAM, on the other hand, is tailored towards general use, applications, and managing the operating system. Especially in the context of graphics generation, VRAM generally operates 4-5x faster than its counterpart, vital to the millions of computations made to render scenes. As a result, dedicated graphic cards ultimately perform better

If this is the case, why are integrated graphics so important? Well, it is because not everyone needs such a powerful GPU. To the average user, video game frame rates, video editing render times, and training NN models aren’t at the forefront of what people require in their computers and, more commonly, their laptops. At most, the average consumer requires smooth navigation between tasks with their most “intensive” process being smooth video playback. These are all tasks IGPUs handle with ease. As a result, IGPUs are able to replace dedicated GPUs in most instances allowing for:

Less money spent on dedicated graphics => cheaper builds
Less transistors => less power and less heat => more battery efficient
Less space required for GPU and dedicated cooling => thinner, lighter, more transportable laptops

Figure: Intel Gen 11 Processor Graphics high level layout

IGPU	Dedicated GPU
Slower	Faster
Shares RAM with CPU	Dedicated VRAM
Designed for general use	Designed for computational intensive processes (ie gaming, training NN)
More power efficient	Separate Unit and power Intensive
Cheaper
More power and space efficient

The Future of GPUs

As complex graphics, computer vision, and machine learning algorithms become all the more prevalent, it is likely GPU hardware will continue to specialize for complex tasks. Increasingly, major tech companies are building out their own chips for their own hardware rather than relying on preexisting hardware produced by other companies.

Virtual and Augmented Reality

Products such as Apple's Vision Pro platform make special use of System on a Chip (SoC) designs. SoC circuits compress all of a system's components onto one piece of silicon, allowing for increased power and speed with a smaller physical footprint. On the Vision Pro, Apple makes use of an integrated 10-core GPU, as well as a separate neural engine.

Meta's much anticipated new Orion Augmented Reality platform packs complex hardware into a small pocket-sized puck and glasses, an impressive feat for such as lightweight product. To do this it makes use of a specialized SoC and a custom designed discrete GPU made in-house at Meta. In fact, in recent years many major tech companies has developed their own in-house GPU hardware to fit specialized tasks.

Machine Learning

As machine learning based programs enter our daily lives, companies have adapted their hardware to give programmers the tools needed to develop them further. Tensor Processing Units are specialized application-specific integrated circuits (ASICs) designed to accelerate machine learning workloads. Evolved from the same principle behind the GPU, they offload parallel tasks to external hardware, in this case matrix multiply and accumulate operations. Google's TPU uses specialized machine code generated by the XLA just-in-time compiler, directly turning machine learning parameters into usable machine code.

Above Figure: A diagram of Google's TPU converting pixels into input values for matrix multiply and accumulate operations for a simple classification model. Pixel inputs shown as red dots on the left, and matrix multiply and accumulate units shown as a grid.

Apple's Neural Engine technology has also been integrated into everything from MacBooks to iPhones to the Vision Pro, and is an example of a Neural Processing Unit (NPU). These chips are focused on inference and specifically machine learning parallel processing tasks. They are often optimized for multimedia processing tasks, including speech processing and computer vision operations running on smaller devices.