CUDA Memory - yszheda/wiki GitHub Wiki

Device Memory Spaces

Memory Location on/off chip Cached Access Scope Lifetime
Register On n/a R/W 1 thread Thread
Local Off Yes R/W 1 thread Thread
Shared On n/a R/W All threads in block Block
Global Off R/W All threads + host Host allocation
Constant Off Yes R All threads + host Host allocation
Texture Off Yes R All threads + host Host allocation

Texture Memory


When to Use Texture Memory in CUDA

  • If you update your data rarely but read it often...especially if there tends to be some kind of spatial locality to the read access pattern...
    • i.e. nearby threads access nearby locations in the texture
    • especially if the precise read access pattern is difficult to predict
  • Also, you need to use texture memory in order to visualize your data using the graphics pipeline

When Not to Use Texture Memory in CUDA

  • We should not use texture memory when we read our input data exactly once after update it.

Texture Object


  • There is no need to know at compile time which textures will be used at run time, which enables much more dynamic execution and flexible programming
  • texture objects only need to be instantiated once, and are not subject to the hardware limit of 128 texture references, so there is no need to continuously bind and unbind them. Using texture objects, the overhead of binding (up to 1 μs) and unbinding (up to 0.5 μs) textures is eliminated. What is not commonly known is that each outstanding texture reference that is bound when a kernel is launched incurs added launch latency—up to 0.5 μs per texture reference. This launch overhead persists even if the outstanding bound textures are not even referenced by the kernel. Again, using texture objects instead of texture references completely removes this overhead.

Host Memory

Pinned memory

  • Host memory that is essentially from virtual memory
  • OS guarantees the page-locked memory will never be paged swapped out, which means it’s always in physical memory.
  • Also called Page-locked memory

Max pinned memory size

Mapped memory

  • The page-locked host memory can be mapped into the address space of the device using flag cudaHostAllocMapped

  • Hardware may not support this function

    • Using cudaGetDeviceProperties to check the canMapHostMemory property
  • Advantages:

    • No explicit memory copy.
    • Can perform read/write concurrently
  • Disadvantages:

    • GPU and CPU can write the same address simultaneously
    • Atomic functions cannot guarantee for CPU concurrent writes


Write-combining memory

  • Pinned memory is allocated as cacheable by default

  • When allocated as write-combining memory, it frees up L1 and L2 cache resource usage.

  • Advantage and disadvantage

    • Write-combining memory is not snooped during transfers across bus, which can improve transfer performance by up to 40%
    • Reading from write-combining memory from host is slow, which should in general be used for memory that the host only write to.

Unified Memory


Pascal GPUs such as the NVIDIA Titan X and the NVIDIA Tesla P100 are the first GPUs to include the Page Migration Engine, which is hardware support for Unified Memory page faulting and migration

On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU.

On Pascal and later GPUs, managed memory may not be physically allocated when cudaMallocManaged() returns; it may only be populated on access (or prefetching). In other words, pages and page table entries may not be created until they are accessed by the GPU or the CPU. The pages can migrate to any processor’s memory at any time, and the driver employs heuristics to maintain data locality and prevent excessive page faults.

The kernel launches without any migration overhead, and when it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads.

Use Unified Memory prefetching to move the data to the GPU after initializing it: cudaMemPrefetchAsync()

Simultaneous access to managed memory from the CPU and GPUs of compute capability lower than 6.0 is not possible. This is because pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these GPUs, an access from the CPU while a kernel is running will cause a segmentation fault.

On Pascal and later GPUs, the CPU and the GPU can simultaneously access managed memory, since they can both handle page faults; however, it is up to the application developer to ensure there are no race conditions caused by simultaneous accesses.

Calling cudaDeviceSynchronize() after the kernel launch ensures that the kernel runs to completion before the CPU tries to read the results from the managed memory pointer. Otherwise, the CPU may read invalid data (on Pascal and later), or get a segmentation fault (on pre-Pascal GPUs).


Memory page faulting support in GP100 is a crucial new feature that provides more seamless Unified Memory functionality. Combined with the system-wide virtual address space, page faulting provides several benefits. First, page faulting means that the CUDA system software doesn’t need to synchronize all managed memory allocations to the GPU before each kernel launch. If a kernel running on the GPU accesses a page that is not resident in its memory, it faults, allowing the page to be automatically migrated to the GPU memory on-demand. Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects (mapping on access can sometimes be faster than migration). Note that Unified Memory is system-wide: GPUs (and CPUs) can fault on and migrate memory pages either from CPU memory or from the memory of other GPUs in the system.

With the new page fault mechanism, global data coherency is guaranteed with Unified Memory. This means that with GP100, the CPUs and GPUs can access Unified Memory allocations simultaneously. This was illegal on Kepler and Maxwell GPUs, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active. Note, as with any parallel application, developers need to ensure correct synchronization to avoid data hazards between processors.




  • Issues related to “transfer/execution overlap”:
    • Pages from managed allocations touched by CPU migrated back to GPU before any kernel launch
      • Consequence: there is no kernel execution/data transfer overlap in that stream
      • Overlap possible with UM but just like before it requires multiple kernels in separate streams
        • Enabled by the fact that a managed allocation can be specific to a stream
        • Allows one to control which allocations are synchronized on specific kernel launches, enables concurrency

如果有足够的active warp,Unified Memory可以overlap data transfer和kernel execution。


When Is This Helpful?

  • When it doesn’t matter how data moves to a processor
  1. Quick and dirty algorithm prototyping
  2. Iterative process with lots of data reuse, migration cost can be amortized
  3. Simplify application debugging
  • When it’s difficult to isolate the working set
  1. Irregular or dynamic data structures, unpredictable access
  2. Data partitioning between multiple processors

Memory Oversubscription

  • When you have large dataset and not enough physical memory
  • Moving pieces by hand is error-prone and requires tuning for memory size
  • Better to run slowly than get fail with out-of-memory error
  • You can actually get high performance with Unified Memory!

System-Wide Atomics with Exclusive Access

  • GPUs are very good at handling atomics from thousands of threads
  • Makes sense to utilize atomics between GPUs or between CPU and GPU

The Unified Memory driver is doing intelligent things under the hood:

  • Prefetching: migrate pages proactively to reduce number of faults
  • Thrashing mitigation: heuristics to avoid frequent migration of shared pages
  • Eviction: what pages to evict when we need to make the room for new ones

DRIVER PREFETCHING

  • GPU architecture supports different page sizes
  • Contiguous pages up to a larger page size are promoted to the larger size
  • Driver prefetches whole regions if pages are accessed densely

ANTI-THRASHING POLICY

  • Processors share the same page and frequently read or write to it

EVICTION ALGORITHM

  • Driver keeps a single list of physical chunks of GPU memory
  • Chunks from the front of the list are evicted first (LRU)
  • A chunk is considered “in use” when it is fully-populated or migrated
// Similar to move_pages() in Linux
cudaMemPrefetchAsync(ptr, size, processor, stream)
// Similar to madvise() in Linux
cudaMemAdvise(ptr, size, advice, processor)

Page Granularity Overhead

  • cudaMallocManaged alignment: 512B on Pascal/Volta, 4KB on Kepler/Maxwell
    • Too many small allocations will use up many pages
  • cudaMallocManaged memory is moved at system page granularity
    • For small allocations more data could be moved than necessary
  • Solution: use cached allocator or memory pools

READ DUPLICATION cudaMemAdviseSetReadMostly

// Use when data is mostly read and occasionally written to

init_data(data, N);
cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId);
// Read-only copy will be created on GPU page fault
mykernel<<<...>>>(data, N);
// CPU reads will not page fault
use_data(data, N);
// Prefetching creates read-duplicated copy of data and avoids page faults
// Note: writes are allowed but will generate page fault and remapping


init_data(data, N);
cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId);
// Read-only copy will be created during prefetch
cudaMemPrefetchAsync(data, N, myGpuId, cudaStreamLegacy);
// GPU reads will not page fault
mykernel<<<...>>>(data, N);
// CPU reads will not page fault
use_data(data, N);

DIRECT MAPPING

  • cudaMemAdviseSetPreferredLocation
    • Set preferred location to avoid migrations
    • First access will page fault and establish mapping
  • cudaMemAdviseSetAccessedBy
    • Pre-map data to avoid page faults
    • First access will not page fault
    • Actual data location can be anywhere




CUDA Array

memcpy

cudaMemcpy2D

cudaMemcpyDefault

Shared Memory Bank Conflict

32-Bit Strided Access

extern __shared__ float shared[];
float data = shared[BaseIndex + s * tid];

smart pointer

cudaMemset

Memory fence

⚠️ **GitHub.com Fallback** ⚠️