The Maxwell architecture - OrangeOwlSolutions/General-CUDA-programming GitHub Wiki

1. Overview and characteristics

To date, Maxwell is the penultimate NVIDIA's architecture for CUDA applications.

The trend of recent years has been to moving computational performance from PCs, workstations and supercomputers down to mobile chips fitting in a pocket. Accordingly, the primary goal of engineers has been to further reducing GPU power consumption and extracting more performance for the same power level as compared to Kepler.

So, Maxwell is the successor of Kepler and introduces a new design for the SM (addressed to now as SMM) improving its performance per Watt and performance per area. Maxwell shows 2 times the performance/Watt of Kepler, using the same 28nm manufacturing process. This makes Maxwell GPUs suited for use in power-limited environments like notebooks and Small Form Factor PCs often used for gaming and home entertainment, besides mainstream desktops.

Concerning performance, occupancy is the same or better on SMM than on SMX, instruction latency is reduced and so utilization and throughput are improved. More in detail, the throughput of many integer operations including multiply, logical operations and shift is improved, and there are now specialized integer instructions that can accelerate pointer arithmetics. These instructions are most efficient when data structures are a power of two in size.

Maxwell exploits and extends the same CUDA programming model as in Fermi and Kepler so that applications meeting the best practices for those architectures should typically run fast on Maxwell without code changes.

Below, the main features of Maxwell are exposed. The illustration is incremental against Kepler. Please, have a look at The Kepler architecture article.

2. Streaming Multiprocessor (SMM)

Maxwell GPUs feature from 13 to 16 SMMs, depending on the card.

To achieve better power efficiency, SMM uses a partitioned, quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock. The number of CUDA Cores per SMM has been reduced to a power of two, namely, from 192 for Kepler to 128 for Maxwell. With 16 SMMs, the GeForce GTX 980 features a total of 2048 CUDA cores.

Each SMM provides eight texture units, and dedicated register file and shared memory.

The register file is composed by 64k 32-bit registers and is the same as that of SMX. Similarly, the maximum number of registers per thread is 255, as it occurred for Kepler GK110.

3. Warp scheduling, non-shared resources and power efficiency

Each warp scheduler controls one set of 32 single precision CUDA cores, one set of 8 load/store units and one set of 8 SFUs. This differs from Kepler, where each SMX has 4 schedulers that schedule to a shared pool of execution units. The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM’s warp schedulers issue to a dedicated set of CUDA cores equal to the warp width. In an SMX, the 4 warp schedulers share most of their execution resources and work out which warp was on which execution resource for any given cycle. Shared resources, though extremely useful when one has workloads to fill them, are demanding in terms of space and power and there is additional scheduling overhead from having to coordinate the actions of those warp schedulers. By forgoing the shared resources, NVIDIA loses out on some of the performance benefits from the design, but what they gain in power and space efficiency more than makes up for it. For space efficiency, a single 128 CUDA core SMM can deliver about 90% of the performance of a 192 CUDA core SMX at a much smaller size. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA cores.

Moving on, along with the SMM layout changes NVIDIA has also made a number of small tweaks to improve the IPC of the GPU. The scheduler has been rewritten to avoid stalls and otherwise behave more intelligently.

4. Registers and Active Thread Blocks Per SMM

The number of active warps per multiprocessor (64), the register file size (64K 32-bit registers), and the maximum registers used per thread (255) are the same on Maxwell as they were on (most) Kepler GPUs. But Maxwell increases the maximum active thread blocks per SM from 16 to 32. This should help improve occupancy of kernels running on small thread blocks (such as 64 threads per block or fewer, assuming available registers and shared memory are not the occupancy limiter).

5. Maxwell memory system

The register file size (64K 32-bit registers), and the maximum registers used per thread (255) are the same on Maxwell as they were on (most) Kepler GPUs.

Maxwell separates shared memory from L1 cache, providing a dedicated 64KB shared memory in each SMM, unlike Fermi and Kepler, which partitioned the 64KB of memory between L1 cache and shared memory. GM204 does even better, increasing shared memory to 96KB per SMM. It should be however noticed that the maximum shared memory per thread block is still 48KB, just like on Kepler and Fermi, but, as a result of the larger available shared memory, kernels whose occupancy is limited by shared memory capacity (or that use all 48KB in each thread block), could achieve up to 2x the occupancy that they achieved on Kepler.

The L2 cache is now 2MB, namely, four times larger than on Kepler, which was 256KB. As a consequence, the memory bus has been reduced from 192 bit on Kepler to 128 bit, to save power. With more L2 cache, fewer requests to DRAM are needed, thus contributing to power reduction and improving performance.

6. Shared Memory Atomics

Kepler introduced improvements to the performance of atomic operations to device memory, but shared memory atomics used a lock/update/unlock pattern that could be expensive in the case of high contention for updates to particular locations in shared memory. Maxwell introduces native shared memory atomic operations for 32-bit integers and native shared memory 32-bit and 64-bit compare-and-swap (CAS), which can be used to implement other atomic functions with reduced overhead compared to the Fermi and Kepler methods. This should make it much more efficient to implement things like list and stack data structures shared by the threads of a block.

7. Dynamic parallelism and HyperQ

Dynamic Parallelism and HyperQ are supported.