GPGPU - yszheda/wiki GitHub Wiki

Kernel mode GPGPU usage

GPUDirect

HBM

Drivers

DRM

https://en.wikipedia.org/wiki/Direct_Rendering_Manager

NVLINK

MIG

Ride the Fast Lane to AI Productivity with Multi-Instance GPUs

NCCL

https://developer.nvidia.com/nccl

MPS

Render

万字长文，GPU Render Engine 详细介绍

Architecture

Volta

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

Pascal

https://en.wikipedia.org/wiki/Pascal_(microarchitecture)

https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

Pascal Streaming Multiprocessor

The GP100 SM is partitioned into two processing blocks, each having 32 single-precision CUDA Cores, an instruction buffer, a warp scheduler, and two dispatch units.

threads across the GPU have access to more registers, and GP100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.

Each warp scheduler (one per processing block) is capable of dispatching two warp instructions per clock.

Support for FP16 Arithmetic Speeds Up Deep Learning

Note: In GP100, two FP16 operations can be performed using a single paired-operation instruction.

L1/L2 Cache Changes in GP100

A dedicated shared memory per SM means applications no longer need to select a preference of the L1/shared split for optimal performance– the full 64 KB per SM is always available for shared memory.

Compute Preemption

The new Pascal GP100 Compute Preemption feature allows compute tasks running on the GPU to be interrupted at instruction-level granularity, and their context swapped to GPU DRAM. This permits other applications to be swapped in and run, followed by the original task’s context being swapped back in to continue execution where it left off.

In contrast, the Kepler GPU architecture only provided coarser-grained preemption at the level of a block of threads in a compute kernel. This blocklevel preemption required that all threads of a thread block complete before the hardware can context switch to a different context.

Preemption Improved: Fine-Grained Preemption for Time-Critical Tasks

Maxwell

5 Things You Should Know About the New Maxwell GPU Architecture

1. The Heart of Maxwell: More Efficient Multiprocessors

Improved Instruction Scheduling

As with SMX, each SMM has four warp schedulers, but unlike SMX, all core SMM functional units are assigned to a particular scheduler, with no shared units.

Increased Occupancy for Existing Code

The register file size and the maximum number of concurrent warps in SMM are the same as in SMX (64k 32-bit registers and 64 warps, respectively), as is the maximum number of registers per thread (255).

However the maximum number of active thread blocks per multiprocessor has been doubled over SMX to 32, which should result in an automatic occupancy improvement for kernels that use small thread blocks of 64 or fewer threads (assuming available registers and shared memory are not the occupancy limiter).

Reduced Arithmetic Instruction Latency

2. Larger, Dedicated Shared Memory

A significant improvement in SMM is that it provides 64KB of dedicated shared memory per SM—unlike Fermi and Kepler, which partitioned the 64KB of memory between L1 cache and shared memory.

3. Fast Shared Memory Atomics

Maxwell provides native shared memory atomic operations for 32-bit integers and native shared memory 32-bit and 64-bit compare-and-swap (CAS), which can be used to implement other atomic functions.

4. Support for Dynamic Parallelism

Maxwell: The Most Advanced CUDA GPU Ever Made

SMM: The Maxwell Multiprocessor

SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock. Each SMM provides eight texture units, one polymorph engine (geometry processing for graphics), and dedicated register file and shared memory.

there are now specialized integer instructions that can accelerate pointer arithmetic. These instructions are most efficient when data structures are a power of two in size, and here’s a tip provided by the Maxwell Tuning Guide:

Note: As was already the recommended best practice, signed arithmetic should be preferred over unsigned arithmetic wherever possible for best throughput on SMM. The C language standard places more restrictions on overflow behavior for unsigned math, limiting compiler optimization opportunities.

Larger, Dedicated Shared Memory (96KB)

Larger L2 Cache

Shared Memory Atomics

Maxwell introduces native shared memory atomic operations for 32-bit integers and native shared memory 32-bit and 64-bit compare-and-swap (CAS), which can be used to implement other atomic functions with reduced overhead compared to the Fermi and Kepler methods. This should make it much more efficient to implement things like list and stack data structures shared by the threads of a block.

More Active Thread Blocks Per SM

Maxwell increases the maximum active thread blocks per SM from 16 to 32. This should help improve occupancy of kernels running on small thread blocks (such as 64 threads per block).

Kepler

Fermi

https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

The key architectural highlights of Fermi are:

• Third Generation Streaming Multiprocessor (SM)
- o 32 CUDA cores per SM, 4x over GT200
- o 8x the peak double precision floating point performance over GT200
- o Dual Warp Scheduler simultaneously schedules and dispatches instructions from two independent warps
- o 64 KB of RAM with a configurable partitioning of shared memory and L1 cache
• Second Generation Parallel Thread Execution ISA
- o Unified Address Space with Full C++ Support
- o Optimized for OpenCL and DirectCompute
- o Full IEEE 754-2008 32-bit and 64-bit precision
- o Full 32-bit integer path with 64-bit extensions
- o Memory access instructions to support transition to 64-bit addressing
- o Improved Performance through Predication
• Improved Memory Subsystem
- o NVIDIA Parallel DataCache hierarchy with Configurable L1 and Unified L2 Caches
- o First GPU with ECC memory support
- o Greatly improved atomic memory operation performance
• NVIDIA GigaThread Engine
- o 10x faster application context switching
- o Concurrent kernel execution
- o Out of Order thread block execution
- o Dual overlapped memory transfer engines

Each thread within a thread block executes an instance of the kernel, and has a thread ID within its thread block, program counter, registers, per-thread private memory, inputs, and output results.

In the CUDA parallel programming model, each thread has a per-thread private memory space used for register spills, function calls, and C automatic array variables. Each thread block has a per-Block shared memory space used for inter-thread communication, data sharing, and result sharing in parallel algorithms. Grids of thread blocks share results in Global Memory space after kernel-wide global synchronization.

CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks; and CUDA cores and other execution units in the SM execute threads. The SM executes threads in groups of 32 threads called a warp.

Third Generation Streaming Multiprocessor

512 High Performance CUDA cores

Each SM features 32 CUDA processors.
Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU).
The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately.

16 Load/Store Units

Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock. Supporting units load and store the data at each address to cache or DRAM.

Four Special Function Units

Special Function Units (SFUs) execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

Designed for Double Precision

up to 16 double precision fused multiply-add operations can be performed per SM, per clock

Dual Warp Scheduler

Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream.

64 KB Configurable Shared Memory and L1 Cache

Second Generation Parallel Thread Execution ISA

The primary goals of PTX are:

Provide a stable ISA that spans multiple GPU generations
Achieve full GPU performance in compiled applications
Provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.
Provide a code distribution ISA for application and middleware developers
Provide a common ISA for optimizing code generators and translators, which map PTX to specific target machines.
Facilitate hand-coding of libraries and performance kernels
Provide a scalable programming model that spans GPU sizes from a few cores to many parallel cores

PTX 2.0 introduces several new features that greatly improve GPU programmability, accuracy, and performance: full IEEE 32-bit floating point precision, unified address space for all variables and pointers, 64-bit addressing, and new instructions for OpenCL and DirectCompute. Most importantly, PTX 2.0 was specifically designed to provide full support for the C++ programming language.

Unified Address Space enables Full C++ Support

Fermi and the PTX 2.0 ISA implement a unified address space that unifies the three separate address spaces (thread private local, block shared, and global) for load and store operations.

Optimized for OpenCL and DirectCompute

IEEE 32-bit Floating Point Precision

Memory Subsystem Innovations

NVIDIA Parallel DataCache with Configurable L1 and Unified L2 Cache

Traditional GPU architectures support a read-only "load" path for texture operations and a write-only "export" path for pixel data output. However, this approach is poorly suited to executing general purpose C or C++ thread programs that expect reads and writes to be ordered. (read after write hazard)

The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 cache that services all operations (load, store and texture).

First GPU with ECC Memory Support

Naturally occurring radiation can cause a bit stored in memory to be altered, resulting in a soft error. ECC technology detects and corrects single-bit soft errors before they affect the system.

Fermi supports Single-Error Correct Double-Error Detect (SECDED) ECC codes that correct any single bit error in hardware as the data is accessed. In addition, SECDED ECC ensures that all double bit errors and many multi-bit errors are also be detected and reported so that the program can be re-run rather than being allowed to continue executing with bad data.

Fermi’s register files, shared memories, L1 caches, L2 cache, and DRAM memory are ECC protected.

Fermi supports industry standards for checking of data during transmission from chip to chip.

Fast Atomic Memory Operations

GigaThread Thread Scheduler

At the chip level, a global work distribution engine schedules thread blocks to various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its execution units.

10x Faster Application Context Switching

The Fermi pipeline is optimized to reduce the cost of an application context switch to below 25 microseconds

Concurrent Kernel Execution

On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources. Kernels from different application contexts can still run sequentially with great efficiency thanks to the improved context switching performance.