Device Memory Spaces

Memory	Location	on/off chip Cached	Access	Scope	Lifetime
Register	On	n/a	R/W	1 thread	Thread
Local	Off	Yes	R/W	1 thread	Thread
Shared	On	n/a	R/W	All threads in block	Block
Global	Off	†	R/W	All threads + host	Host allocation
Constant	Off	Yes	R	All threads + host	Host allocation
Texture	Off	Yes	R	All threads + host	Host allocation

Texture Memory

Texture Memory in CUDA | What is Texture Memory in CUDA programming

When to Use Texture Memory in CUDA

If you update your data rarely but read it often...especially if there tends to be some kind of spatial locality to the read access pattern...
- i.e. nearby threads access nearby locations in the texture
- especially if the precise read access pattern is difficult to predict
Also, you need to use texture memory in order to visualize your data using the graphics pipeline

When Not to Use Texture Memory in CUDA

We should not use texture memory when we read our input data exactly once after update it.

Texture Object

CUDA Pro Tip: Kepler Texture Objects Improve Performance and Flexibility

There is no need to know at compile time which textures will be used at run time, which enables much more dynamic execution and flexible programming

texture objects only need to be instantiated once, and are not subject to the hardware limit of 128 texture references, so there is no need to continuously bind and unbind them. Using texture objects, the overhead of binding (up to 1 μs) and unbinding (up to 0.5 μs) textures is eliminated. What is not commonly known is that each outstanding texture reference that is bound when a kernel is launched incurs added launch latency—up to 0.5 μs per texture reference. This launch overhead persists even if the outstanding bound textures are not even referenced by the kernel. Again, using texture objects instead of texture references completely removes this overhead.

Host Memory

Pinned memory

Host memory that is essentially from virtual memory
OS guarantees the page-locked memory will never be ~~paged~~ swapped out, which means it’s always in physical memory.
Also called Page-locked memory

Max pinned memory size

Mapped memory

The page-locked host memory can be mapped into the address space of the device using flag cudaHostAllocMapped
Hardware may not support this function
- Using cudaGetDeviceProperties to check the canMapHostMemory property
Advantages:
- No explicit memory copy.
- Can perform read/write concurrently
Disadvantages:
- GPU and CPU can write the same address simultaneously
- Atomic functions cannot guarantee for CPU concurrent writes

零复制(Zero Copy)(零拷贝内存)

Write-combining memory

Pinned memory is allocated as cacheable by default
When allocated as write-combining memory, it frees up L1 and L2 cache resource usage.
Advantage and disadvantage
- Write-combining memory is not snooped during transfers across bus, which can improve transfer performance by up to 40%
- Reading from write-combining memory from host is slow, which should in general be used for memory that the host only write to.

Unified Memory

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Unified Memory for CUDA Beginners

Pascal GPUs such as the NVIDIA Titan X and the NVIDIA Tesla P100 are the first GPUs to include the Page Migration Engine, which is hardware support for Unified Memory page faulting and migration

On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU.

On Pascal and later GPUs, managed memory may not be physically allocated when cudaMallocManaged() returns; it may only be populated on access (or prefetching). In other words, pages and page table entries may not be created until they are accessed by the GPU or the CPU. The pages can migrate to any processor’s memory at any time, and the driver employs heuristics to maintain data locality and prevent excessive page faults.

The kernel launches without any migration overhead, and when it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads.

Use Unified Memory prefetching to move the data to the GPU after initializing it: cudaMemPrefetchAsync()

Simultaneous access to managed memory from the CPU and GPUs of compute capability lower than 6.0 is not possible. This is because pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these GPUs, an access from the CPU while a kernel is running will cause a segmentation fault.

On Pascal and later GPUs, the CPU and the GPU can simultaneously access managed memory, since they can both handle page faults; however, it is up to the application developer to ensure there are no race conditions caused by simultaneous accesses.

Calling cudaDeviceSynchronize() after the kernel launch ensures that the kernel runs to completion before the CPU tries to read the results from the managed memory pointer. Otherwise, the CPU may read invalid data (on Pascal and later), or get a segmentation fault (on pre-Pascal GPUs).

CUDA 8 Features Revealed

Memory page faulting support in GP100 is a crucial new feature that provides more seamless Unified Memory functionality. Combined with the system-wide virtual address space, page faulting provides several benefits. First, page faulting means that the CUDA system software doesn’t need to synchronize all managed memory allocations to the GPU before each kernel launch. If a kernel running on the GPU accesses a page that is not resident in its memory, it faults, allowing the page to be automatically migrated to the GPU memory on-demand. Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects (mapping on access can sometimes be faster than migration). Note that Unified Memory is system-wide: GPUs (and CPUs) can fault on and migrate memory pages either from CPU memory or from the memory of other GPUs in the system.

With the new page fault mechanism, global data coherency is guaranteed with Unified Memory. This means that with GP100, the CPUs and GPUs can access Unified Memory allocations simultaneously. This was illegal on Kepler and Maxwell GPUs, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active. Note, as with any parallel application, developers need to ensure correct synchronization to avoid data hazards between processors.

Issues related to “transfer/execution overlap”:

Pages from managed allocations touched by CPU migrated back to GPU before any kernel launch

Consequence: there is no kernel execution/data transfer overlap in that stream

Overlap possible with UM but just like before it requires multiple kernels in separate streams

Enabled by the fact that a managed allocation can be specific to a stream

Allows one to control which allocations are synchronized on specific kernel launches, enables concurrency

如果有足够的active warp，Unified Memory可以overlap data transfer和kernel execution。

unified-memory-on-pascal-and-volta

When Is This Helpful?

When it doesn’t matter how data moves to a processor

Quick and dirty algorithm prototyping

Iterative process with lots of data reuse, migration cost can be amortized

Simplify application debugging

When it’s difficult to isolate the working set

Irregular or dynamic data structures, unpredictable access

Data partitioning between multiple processors

Memory Oversubscription

When you have large dataset and not enough physical memory

Moving pieces by hand is error-prone and requires tuning for memory size

Better to run slowly than get fail with out-of-memory error

You can actually get high performance with Unified Memory!

System-Wide Atomics with Exclusive Access

GPUs are very good at handling atomics from thousands of threads

Makes sense to utilize atomics between GPUs or between CPU and GPU

The Unified Memory driver is doing intelligent things under the hood:

Prefetching: migrate pages proactively to reduce number of faults

Thrashing mitigation: heuristics to avoid frequent migration of shared pages

Eviction: what pages to evict when we need to make the room for new ones

DRIVER PREFETCHING

GPU architecture supports different page sizes

Contiguous pages up to a larger page size are promoted to the larger size

Driver prefetches whole regions if pages are accessed densely

ANTI-THRASHING POLICY

Processors share the same page and frequently read or write to it

EVICTION ALGORITHM

Driver keeps a single list of physical chunks of GPU memory

Chunks from the front of the list are evicted first (LRU)

A chunk is considered “in use” when it is fully-populated or migrated

// Similar to move_pages() in Linux
cudaMemPrefetchAsync(ptr, size, processor, stream)
// Similar to madvise() in Linux
cudaMemAdvise(ptr, size, advice, processor)

Page Granularity Overhead

cudaMallocManaged alignment: 512B on Pascal/Volta, 4KB on Kepler/Maxwell
- Too many small allocations will use up many pages
cudaMallocManaged memory is moved at system page granularity
- For small allocations more data could be moved than necessary
Solution: use cached allocator or memory pools

THE FUTURE OF UNIFIED MEMORY

READ DUPLICATION cudaMemAdviseSetReadMostly

// Use when data is mostly read and occasionally written to

init_data(data, N);
cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId);
// Read-only copy will be created on GPU page fault
mykernel<<<...>>>(data, N);
// CPU reads will not page fault
use_data(data, N);

// Prefetching creates read-duplicated copy of data and avoids page faults
// Note: writes are allowed but will generate page fault and remapping


init_data(data, N);
cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId);
// Read-only copy will be created during prefetch
cudaMemPrefetchAsync(data, N, myGpuId, cudaStreamLegacy);
// GPU reads will not page fault
mykernel<<<...>>>(data, N);
// CPU reads will not page fault
use_data(data, N);

DIRECT MAPPING

cudaMemAdviseSetPreferredLocation
- Set preferred location to avoid migrations
- First access will page fault and establish mapping
cudaMemAdviseSetAccessedBy
- Pre-map data to avoid page faults
- First access will not page fault
- Actual data location can be anywhere

Maximizing Unified Memory Performance in CUDA

Beyond GPU Memory Limits with Unified Memory on Pascal

Unified Memory on Tesla P100 with CUDA 8.0

CUDA Array

memcpy

`cudaMemcpy2D`

cudaMemcpyDefault

Shared Memory Bank Conflict

32-Bit Strided Access

extern __shared__ float shared[];
float data = shared[BaseIndex + s * tid];

CUDA Memory - yszheda/wiki GitHub Wiki

Device Memory Spaces

Texture Memory

When to Use Texture Memory in CUDA

When Not to Use Texture Memory in CUDA

Texture Object

Host Memory

Pinned memory

Max pinned memory size

Mapped memory

Write-combining memory

Unified Memory

CUDA Array

memcpy

`cudaMemcpy2D`

cudaMemcpyDefault

Shared Memory Bank Conflict

32-Bit Strided Access

smart pointer

`cudaMemset`

Memory fence

⚠️ GitHub.com Fallback ⚠️

CUDA Memory - yszheda/wiki GitHub Wiki

Device Memory Spaces

Texture Memory

When to Use Texture Memory in CUDA

When Not to Use Texture Memory in CUDA

Texture Object

Host Memory

Pinned memory

Max pinned memory size

Mapped memory

Write-combining memory

Unified Memory

CUDA Array

memcpy

cudaMemcpy2D

cudaMemcpyDefault

Shared Memory Bank Conflict

32-Bit Strided Access

smart pointer

cudaMemset

Memory fence

⚠️ **GitHub.com Fallback** ⚠️

`cudaMemcpy2D`

`cudaMemset`

⚠️ GitHub.com Fallback ⚠️