Chapter 04 Global memory - SaoYan/Learning_CUDA GitHub Wiki

1. CUDA memory model

1.1 Memory hierarchy

  1. principle of locality
  • Temporal locality: this assumes that if a data location is referenced, then it is more likely to be referenced again within a short time period and less likely to be referenced as more and more time passes.
  • Spatial locality: this assumes that if a memory location is referenced, nearby locations are likely to be referenced as well.
  1. memory hierarchy
  • A memory hierarchy consists of multiple levels of memory with different latencies, band- widths, and capacities. In general, as processor-to-memory latency increases, the capacity of that memory increases.
  • This memory hierarchy is only useful because of the principle of locality.
  • A typical hierarchy
  • The types of storage at the bottom of figure are generally characterized by:
    • Lower cost per bit
    • Higher capacity
    • Higher latency
    • Less frequently accessed by the processor
  • Main memory for both CPUs and GPUs is implemented using DRAM (Dynamic Random Access Memory) while lower-latency memory (such as CPU L1 cache) is implemented using SRAM (Static Random Access Memory). The largest and slowest level in the memory hierarchy is generally implemented using a magnetic disk or flash drive.

1.2 CUDA memory model

CUDA memory model exposes many types of programmable memory to you: registers, shared memory, local memory, constant memory, texture memory, global memory

  • Shared memory: owned by a block; shared by all threads within the block
  • Local memory & register: owned by a thread
  • Global memory: accessible to all threads
  • Constant and texture memory: accessible to all threads; read-only
  • speed: registers > shared memory > local memory ~ global memory


Registers

  • Registers are the fastest memory space on a GPU
  • What is stored in registers?
    • An automatic variable declared in a kernel without any other type qualifiers is generally stored in a register.
    • Arrays declared in a kernel may also be stored in registers, but only if the indices used to reference the array are constant and can be determined at compile time.
  • Lifespan: the same as the corresponding kernel
  • What's influence to performance?
    • fewer registers in your kernels --> may allow more thread blocks to reside on an SM --> increase occupancy
    • If a kernel uses more registers than the hardware limit, the excess registers will spill over to local memory, which can have adverse performance consequences.
  • How to query info / control the resources?
    • The following compiling option will print the number of registers, bytes of shared memory, and bytes of constant memory used by each thread.
    nvcc ..... -Xptxas -v,-abi=no
    
    • providing additional information for each kernel to the compiler in the form of launch bounds
      __global__ void
      __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) kernel(...) {
          // your kernel body 
      }
    
    • specifying maximum number of registers used by all kernels in a compilation unit using the maxrregcount compiler option; e.g.
    nvcc .... -maxrregcount=32
    
    Note that the restriction set by maxrregcount is ignored for any kernels that have launch bounds specified.

Local memory

  • Hardware view:
    • Values spilled to local memory reside in the same physical location as global memory.
    • For GPUs with compute capability 2.0 and higher, local memory data is also cached in a per-SM L1 and per-device L2 cache.
  • What are stored in local memory? (in general: variables that are eligible for registers but cannot fit into the register space) Variables that the compiler is likely to place in local memory are:
    • Local arrays referenced with indices whose values cannot be determined at compiletime.
    • Large local structures or arrays that would consume too much register space.
    • Any variable that does not fit within the kernel register limit.

Shared memory

  • Hardware view: Shared memory is on-chip (so is fast then local/global memory); the L1 cache and shared memory for an SM use the same 64 KB of on-chip memory.
  • What are stored in local memory? Variables decorated with __shared__
  • Lifespan: same as the corresponding block
  • What's the influence to the performance?
    • Each SM has a limited amount of shared memory that is partitioned among thread blocks. Over-utilizing shared memory will limit the number of active warps.
    • Inter-thread communication requires synchronization, but __syncthreads may affect performance by forcing the SM to idle frequently.
  • When to use? inter-thread communication. Note: Access to shared memory must be synchronized using __syncthreads()
  • How to query info / control the resources?

Constant memory

  • Hardware view: A limited amount of constant memory can be declared — 64 KB for all compute capabilities
  • What are stored in local memory? Variables decorated with __constant__
    • Constant variables must be declared with global scope, outside of any kernels.
    • Constant memory must therefore be initialized by the host using:
    cudaError_t cudaMemcpyToSymbol(const void* symbol, const void* src, size_t count);
    
  • Lifespan: the whole application
  • When to use?
    • USE when all threads in a warp read from the same memory address; example: a coefficient for a mathematical formula
    • DONOT USE If each thread in a warp reads from a different address, and only reads once. Reason: a single read from constant memory broadcasts to all threads in a warp.

Texture Memory

  • Hardware view: texture memory is a type of global memory that is accessed through a dedicated read-only cache. The read-only cache includes support for hardware filtering, which can perform floating-point interpolation as part of the read process.
  • Lifespan: the whole application
  • When to use? Texture memory is optimized for 2D spatial locality, so threads in a warp that use texture memory to access 2D data will achieve the best performance.

Global Memory

  • Largest, highest-latency
  • What are stored in local memory?
    • dynamically allocate: using cudaMalloc() (we've done this often in Chapter 2 & 3)
    • statically declare in device code using qualifier __device__
  • Lifespan: the whole application
  • What's the influence to the performance?

GPU caches

1.3 Case study - gloal variable

__device__ float devData;

__global__ void checkGlobalVariable() {
    printf("Device: the value of the global variable is %f\n", devData);
    devData += 2.0f;
    __syncthreads();
} 

int main(int argc, char **argv) {
    // initialize the global variable
    float value = 3.14f;
    cudaMemcpyToSymbol(devData, &value, sizeof(float));
    printf("Host:   copied %f to the global variable\n", value);

    // invoke kernel
    checkGlobalVariable<<<1, 64>>>();
    
    // copy back to host
    cudaMemcpyFromSymbol(&value, devData, sizeof(float));
    printf("Host:   the value changed by the kernel to %f\n", value);

    cudaDeviceReset();
    return 0;
}

Note: The host code cannot directly access a device variable.

  • In cudaMemcpyToSymbol(devData, &value, sizeof(float)); devData is passed as a symbol, not the address of the variable.
  • This code is wrong: cudaMemcpy(&devData, &value, sizeof(float),cudaMemcpyHostToDevice); We cannot use the reference operator & on a device variable from the host, because it is simply a symbol representing the physical location on the GPU.
  • However, we can acquire the address of a global variable explicitly, then apply cudaMemcpy:
float *dptr = NULL;
cudaGetSymbolAddress((void**)&dptr, devData);
cudaMemcpy(dptr, &value, sizeof(float), cudaMemcpyHostToDevice);

There is a single exception to being able to directly reference GPU memory from the host: CUDA pinned memory. Both host code and device code can access pinned memory directly by simply dereferencing a pointer. You will learn about pinned memory in the next section.

2. Memory management

2.1 Memory allocation and deallocation

cudaError_t cudaMalloc(void **devPtr, size_t count);

allocates count bytes of global memory on the device and returns the location of that memory in pointer devPtr.

cudaError_t cudaMemset(void *devPtr, int value, size_t count);

fills each of the count bytes starting at the device memory address devPtr with the value stored in the variable value.

cudaError_t cudaFree(void *devPtr);

frees the global memory pointed to by devPtr.

Device memory allocation and deallocation are expensive operations, so device memory should be reused by applications whenever possible to minimize the impact on overall performance.

2.2 Memory transfer

cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind);

copies count bytes from the memory location src to the memory location dst.

Data transfers between the host and device can throttle overall application performance if not managed properly.

2.3 Pinned memory

Allocated host memory is by default pageable, but the GPU cannot safely access data in pageable host memory. When transferring data from pageable host memory to device memory, the CUDA driver first allocates temporary page-locked, or pinned host memory.

The CUDA runtime allows you to directly allocate pinned host memory using:

cudaError_t cudaMallocHost(void **devPtr, size_t count);

and free allocated pinned memory using:

cudaError_t cudaFreeHost(void *ptr);

This function allocates count bytes of host memory that is page-locked and accessible to the device. Since the pinned memory can be accessed directly by the device, it can be read and written with much higher bandwidth than pageable memory. However, allocating excessive amounts of pinned memory might degrade host system performance, since it reduces the amount of pageable memory available to the host system for storing virtual memory data.

2.4 Zero-copy memory

  • Zero-copy memory is pinned (non-pageable) memory that is mapped into the device address space. It can be accessed by both host and device.
  • Advantages of using zero-copy memory:
    • Leveraging host memory when there is insufficient device memory.
    • Avoiding explicit data transfer between the host and device.
    • Improving PCIe transfer rates.
  • When using zero-copy memory to share data between the host and device, you must synchronize memory accesses across the host and device.
  • Zero-copy memory can be allocated using cudaHostAlloc (with cudaHostAllocMapped flag)
cudaError_t cudaHostAlloc(void **pHost, size_t count, unsigned int flags);

and must be freed using

cudaFreeHost();
  • The flags parameter of cudaHostAlloc enables further configuration of special properties of the allocated memory:

    • cudaHostAllocDefault: making the behavior of cudaHostAlloc identical to cudaMallocHost.
    • cudaHostAllocPortable: setting pinned memory that can be used by all CUDA contexts, not just the one that performed the allocation.
    • cudaHostAllocWriteCombined: setting write-combined memory, which can be transferred across the PCI Express bus more quickly on some system configurations but cannot be read efficiently by most hosts. Write-combined memory is a good option for buffers that will be written by the host and read by the device using either mapped pinned memory or host-to-device transfers.
    • cudaHostAllocMapped: this is the most relevant flag to zero-copy memory; setting host memory that is mapped into the device address space.
  • obtaining the device pointer for mapped pinned memory:

cudaError_t cudaHostGetDevicePointer(void **pDevice, void *pHost, unsigned int flags);

returns a device pointer in pDevice that can be referenced on the device to access mapped, pinned host memory. The flag parameter is reserved for future use. For now, it must be set to zero.

Warning: Using zero-copy memory as a supplement to device memory with frequent read/write operations will significantly slow performance. Because every memory transaction to mapped memory must pass over the PCIe bus, a significant amount of latency is added even when compared to global memory.

2.5 Unified virtual addressing (UVA)

Eligibility: compute capability >= 2.0; CUDA >= 4.0; 64-bit Linux systems

With UVA, there is no need to acquire the device pointer or manage two pointers to what is physi- cally the same data. For example, the following code

// allocate zerocpy memory
cudaHostAlloc((void **)&h_A, nBytes, cudaHostAllocMapped);
cudaHostAlloc((void **)&h_B, nBytes, cudaHostAllocMapped);

// initialize data at host side
initialData(h_A, nElem);
initialData(h_B, nElem);

// pass the pointer to device
CHECK(cudaHostGetDevicePointer((void **)&d_A, (void *)h_A, 0));
CHECK(cudaHostGetDevicePointer((void **)&d_B, (void *)h_B, 0));

// execute kernel with zero copy memory
sumArraysZeroCopy<<<grid, block>>>(d_A, d_B, d_C, nElem);

can be simplified as

// allocate zero-copy memory
cudaHostAlloc((void **)&h_A, nBytes, cudaHostAllocMapped); 
cudaHostAlloc((void **)&h_B, nBytes, cudaHostAllocMapped);

// initialize data at the host side 
initialData(h_A, nElem); 
initialData(h_B, nElem);

// invoke the kernel with zero-copy memory  
// no need to pass the pointer to device
sumArraysZeroCopy<<<grid, block>>>(h_A, h_B, d_C, nElem);

2.6 Unified memory

Eligibility: CUDA >= 6.0

Unified Memory creates a pool of managed memory, where each allocation from this memory pool is accessible on both the CPU and GPU with the same memory address (that is, pointer). The underlying system automatically migrates data in the unified memory space between the host and device.

  • Difference from UVA: Unified memory depends on UVA support, but they are entirely different technologies. UVA provides a single virtual memory address space for all processors in the system, i.e same pointer for the same physical memory. However, UVA does not automatically migrate data between different physical locations.
  • Difference from zero-copy memory: zero-copy memory is allocated in host memory; unified memory decouples memory and execution spaces

Managed memory refers to Unified Memory allocations that are automatically managed by the underlying system; un-managed memory are explicitly allocated and transferred by the application (such as using cudaMalloc and cudaMemcpy). Both types of memory can be used together in the kernel. The primary difference is that managed memory can be accessed at the host side.

Allocating unified memory:

  • statically: declare by __managed__
  • dynamically: using
cudaError_t cudaMallocManaged(void **devPtr, size_t size, unsigned int flags=0);

In CUDA 6.0, device code cannot call cudaMallocManaged. All managed memory must be dynamically allocated from the host or statically declared in global scope.

An example:

// allocate unified memory
float *A, *B, *C;
cudaMallocManaged((float**)&A, nBytes);
cudaMallocManaged((float**)&B, nBytes);
cudaMallocManaged((float**)&C, nBytes);

// initialize data at host side
initialData(A, nElem);
initialData(B, nElem);
memset(C, 0, nBytes);

// launch CUDA kernel
int threadPerBlock = 1024;
dim3 block (threadPerBlock);
dim3 grid((nElem + block.x - 1) / block.x);
sumArraysOnDevice<<<grid, block>>>(A, B, C, nElem);
cudaDeviceSynchronize(); // synchronization is necessary when using unified memory!

// free memory
cudaFree(A);
cudaFree(B);
cudaFree(C);

Note

  • Because the kernel launch is asynchronous with respect to the host and there is no longer a blocking call to cudaMemcpy, you need to explicitly synchronize on the host side before directly accessing the output of the kernel.
  • When testing on a system with more than one GPU devices, you should set the CUDA_VISIBLE_DEVICES environment variable by export CUDA_VISIBLE_DEVICES=0. Since managed memory allocations are visible to all devices in a system, you will want to constrain which device is visible to your application so that managed memory is allocated on just one device.

I observed bad performance when using unified memory on GTX 1080 Ti.
Refer to: https://devtalk.nvidia.com/default/topic/1050541/cuda-programming-and-performance/bad-performance-when-using-unified-memory/

Why is this and how to solve?


3. Memory access patterns

Similar yo CUDA execution model, memory operations are also issued per warp: each thread in a warp provides a memory address it is loading or storing; the 32 threads in a warp present a single memory access request, which is serviced by one or more device memory transactions.

Depending on the distribution of memory addresses within a warp, memory accesses can be classified into different patterns:

  • The access pattern for memory loads can be characterized by the following combinations:
    • Cached versus uncached: The load is cached if L1 cache is enabled
    • Aligned versus misaligned: The load is aligned if the first address of a memory access is a even multiple of the cache granularity (either 32 bytes for L2 cache or 128 bytes for L1 cache).
    • Coalesced versus uncoalesced: The load is coalesced if a warp accesses a contiguous chunk of data
  • Memory store operations are relatively simple. Store operations are only cached in the L2 cache before being sent to device memory.

There are two goals to strive for when optimizing device memory bandwidth utilization:

  • Aligned and coalesced memory accesses that reduce wasted bandwidth (3.1-3.4)
  • Sufficient concurrent memory operations to hide memory latency (3.5)

Context:
On Pascal the data access unit is 32B regardless of whether global loads are cached in L1. So it is no longer necessary to turn off L1 caching in order to reduce wasted global memory transactions associated with uncoalesced accesses.


3.1 Aligned and coalesced access

All accesses to global memory go through the L2 cache. Many accesses also pass through the L1 cache, depending on the type of access and your GPU's architecture. On architectures that allow the L1 cache to be used for global memory caching, the L1 cache can be explicitly enabled or disabled using compile flag:

enable:

-Xptxas -dlcm=ca

disable:

-Xptxas -dlcm=cg

If both L1 and L2 caches are used, a memory access is serviced by a 128-byte memory transaction. If only the L2 cache is used, a memory access is serviced by a 32-byte memory transaction.

There are two characteristics of device memory accesses that you should strive for when optimizing your application

  • Aligned memory accesses: This occur when the first address of a device memory transaction is an even multiple of the cache granularity being used to service the transaction (either 32 bytes for L2 cache or 128 bytes for L1 cache). Performing a misaligned load will cause wasted bandwidth.
  • Coalesced memory accesses: This occur when all 32 threads in a warp access a contiguous chunk of memory.

To maximize global memory throughput, it is important to organize memory operations to be both aligned and coalesced. Because access patterns are often determined by the algorithm an application implements, it can be challenging to coalesce memory loads for some applications. However, there are techniques that can help to align application memory accesses in most cases.

Example:

aligned and coalesced memory access: only a single 128-byte memory transaction is required to read the data from device memory

misaligned and uncoalesced memory access: there may be as many as three 128-byte memory transactions to read the data from device memory

3.2 Global memory reads

Cached loads (L1 cache enabled)

  • aligned and coalesced: a single 128-byte transaction is required; bus utilization is 100 percent; no unused data in this transaction.

  • aligned and coalesced; the referenced addresses are not consecutive by thread ID, but rather randomized within a 128-byte range; Such randomized accesses will not inhibit kernel performance.

  • misaligned and coalesced: two 128-byte transactions are required; bus utilization is 50 percent; half the bytes loaded in these two transactions are unused

  • all threads in the warp request the same address: but utilization is very low; suppose the value loaded is 4-bytes, the bus utilization is 4 bytes requested / 128 bytes loaded = 3.125%.

  • (worst case) misaligned and uncoalesced: the threads in a warp request 32 addresses scattered across global memory

One more thing about L1 cache:

Uncached loads (L2 cache only)

Uncached loads do not pass through the L1 cache and are performed at the granularity of memory segments (32-bytes) and not cache lines (128-bytes). These are more fine-grained loads, and can lead to better bus utilization for misaligned or uncoalesced memory accesses.

  • aligned and coalesced (consecutive by thread ID or randomized): addresses for the 128 bytes requested fall within 4 segments

  • misaligned and coalesced: 5 segments; bus utilization 80%; Performance is improved with uncached loads compared to cached loads for these types of requests because fewer unrequested bytes are loaded.

  • all threads in the warp request the same address: bus utilization is 4 bytes requested / 32 bytes loaded = 12.5%, again better than cached load performance in this scenario.

  • misaligned and uncoalesced

Read-only cache

The read-only cache was originally reserved for use by texture memory loads. For GPUs of compute capability >= 3.5, the read-only cache can also support global memory loads as an alternative to the L1 cache.

The granularity of loads through the read-only cache is 32 bytes. In general, these finer granularity loads are better for scattered reads than the L1 cache.

Code sample:

__global__ void copyKernel(int *out, int *in) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = in[idx];
}

There are two ways to direct memory reads through the read-only cache:

  • Using the function __ldg
__global__ void copyKernel(int *out, int *in) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = __ldg(&in[idx]);
}
  • Applying const __restrict__ qualifiers to pointers
__global__ void copyKernel(int * __restrict__ out, const int * __restrict__ in) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = in[idx];
}

3.3 Global memory writes

Memory store operations are relatively simple. The L1 cache is not used for store operations on either Fermi or Kepler GPUs. Store operations are only cached in the L2 cache before being sent to device memory. Stores are performed at a 32-byte segment granularity. Memory transactions can be one, two, or four segments at a time.

  • align and coalesced: the store request is serviced by one four-segment transaction:

the store request is serviced by one two-segment transaction:

  • aligned and uncoalesced: this store request is serviced by three one-segment transactions

3.4 Array of structures versus structure of arrays

Consider the following example that stores a set of paired float data elements.

  • Array of structures (AoS)
typedef struct innerStruct {
    float x;
    float y;
} innerStruct;

struct innerStruct myAoS[N];
  • Structure of arrays (SoA)
typedef struct innerArray {
    float x[LEN];
    float y[LEN];
} innerArray;

struct innerArray mySoA;
  • AoS v.s SoA
    Storing the example data in AoS format on the GPU and performing an operation that only requires the x field would result in a 50 percent loss of bandwidth as y values are implicitly loaded in each 32-byte segment or 128-byte cache line.

Many parallel programming paradigms, in particular SIMD-style paradigms, prefer SoA. In CUDA C programming, SoA is also typically preferred because data elements are prearranged for efficient coalesced access to global memory.

Case study

  • AoS
__global__ void incrementInnerStruct(innerStruct *input, innerStruct *output, const int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        innerStruct tmp = input[i];
        tmp.x += 10.f;
        tmp.y += 20.f;
        output[i] = tmp;
    }
}

Both gld_efficiency and gst_efficiency are 50 percent, which indicates that both load and store memory requests are replayed for the AoS data layout.

Reason: Because the fields x and y are stored adjacent in memory and have the same size, every time a memory transaction is performed to load the values of a particular field, exactly half of the bytes loaded must also belong to the other field.

  • SoA
__global__ void incrementInnerArray(innerArray *input, innerArray * output, const int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float tmpx = input->x[i];
        float tmpy = input->y[i];
        output->x[i] = tmpx + 10.f;
        output->y[i] = tmpy + 20.f;
    }
}

Both gld_efficiency and gst_efficiency are 100 percent.

Aligning structure memory

typedef struct __align__(8) innerStruct {
    float x;
    float y;
} innerStruct;

The __align__ compiler directive in CUDA ensures that all instances of a type start at the specified byte alignment. Doing so allows nvcc to perform more advanced optimizations in how global memory accesses are performed. When aligned, both gld_efficiency and gst_efficiency for AoS demo are 100 percent.

3.5 Hiding memory latency

[unconcise] Unrolling


This part is unconcise. Check it here: https://devtalk.nvidia.com/default/topic/1050376/cuda-programming-and-performance/unrolling-makes-performance-worse/

Some highlighted replies under this post:

  • The kernel code is limited by memory bandwidth, trying to "save instructions" isn't going to do any good. For memory-bound code, access patterns can be crucial to achieve the highest possible memory throughput, and the unrolled code is already using the most suitable access pattern.
  • I have never seen the term unrolling being applied to anything other than loops (which are the only "rolled" form of control transfer) and I have been involved with software engineering for 30 years now.
  • The process of adding more work-items per thread can improve performance by
    1. reducing registers per work-item as two work-items in the same thread may share common values
    2. adding additional compute work between memory accesses
    3. potentially eliminating tail effect in a thread block

A simple vector that is memory throughput limited does not benefit from any of the above optimizations. For this code the SM will be at maximum occupancy, all warps will be stalled on memory loads (long_scoreboard), and the SM math pipelines will be mostly idle.


Exposing more parallelism

To expose sufficient parallelism, you should experiment with the grid and block size of a kernel launch to find the best execution configuration for that kernel.

Experiments in the books are as follows:

4. What bandwith can a kernel achieve

Basic question: Sometimes a bad access pattern is inherent to the nature of the problem at hand. How good is good enough for such a kernel?

4.1 Case study: matrix transpose

  1. Host-side

Host-side code 1: read by rows (coalesced access); store by columns (strided access)

void transposeHost1(float *out, float *in, const int nx, const int ny) { 
    for (int iy = 0; iy < ny; ++iy) {
        for (int ix = 0; ix < nx; ++ix) { 
            out[ix*ny+iy] = in[iy*nx+ix];
        }
    }
}

Host-side code 2: read by columns (strided access); store by rows (coalesced access)
(switching the two for loops)

void transposeHost2(float *out, float *in, const int nx, const int ny) { 
    for (int ix = 0; ix < nx; ++ix) {
        for (int iy = 0; iy < ny; ++iy) { 
            out[ix*ny+iy] = in[iy*nx+ix];
        }
    }
}
  1. CUDA kernels:

CUDA kernel 1 - read by rows (coalesced access); store by columns (strided access)

__global__ void transposeRowCol(float *out, float *in, int nx, int ny) {
    int ix = blockDim.x * blockIdx.x + threadIdx.x;
    int iy = blockDim.y * blockIdx.y + threadIdx.y;
    if (ix < nx && iy < ny) {
        out[ix * ny + iy] = in[iy * nx + ix];
    }
}

Analysis of kernel 1:

  • Memory load: 4 32-byte segments required; efficiency: 100 percent
  • Memory store: (consider each 32-byte segment) efficiency = 4 bytes used / 32 bytes scheduled = 12.5 percent

CUDA kernel 2 - read by columns (strided access); store by rows (coalesced access)

__global__ void transposeColRow(float *out, float *in, int nx, int ny) {
    int ix = blockDim.x * blockIdx.x + threadIdx.x;
    int iy = blockDim.y * blockIdx.y + threadIdx.y;
    if (ix < nx && iy < ny) {
        out[iy * nx + ix] = in[ix * ny + iy];
    }
}

Analysis of kernel 2:

  • Memory load: 25 percent
    • The code is quite symetric, but why not 12.5 percent here? Memory load can benefit from caching in that the next read may be serviced out of cache rather than global memory. For example, when loading array index nx, data can be loaded from the cache scheduled when loading index 0. Thus efficiency = (4 bytes * 2) used / 32 bytes scheduled = 25 percent (considering each segment)
  • Memory store: 100 percent

Performance: kernel 2 better than kernel 1

4.2 Diagonal transpose

From the hardware perspective all blocks are arranged one-dimensionally. Each block has its own unique identifier (CUDA Thread Indexing Cheatsheet). When a kernel is launched, the order in which thread blocks are assigned to SMs is determined by the block ID, but the speed and order in which thread blocks complete is not deterministic.

Although you do not have direct control over the order in which thread blocks are scheduled, you can be flexible in how you interpret the block coordinates (i.e. how to map map thread blocks to data blocks). Previously we always interpret coordinates by Cartesian coordinate system. Here we'll explore another strategy: using a diagonal block coordinate system.

The diagonal coordinate system is used to determine 1D thread block ID. However, for data access, you still need to use the Cartesian coordinate system. Therefore, when interpreting block IDs as diagonal coordinates, you need to map from diagonal coordinates to Cartesian coordinates.

In the following code, blockIdx.x and blockIdx.y are interpreted as diagonal coordinates. blk_x and blk_y are their corresponding Cartesian coordinates.

__global__ void transposeDiagonalRowCol(float *out, float *in, const int nx, const int ny) {
    int blk_x = (blockIdx.x + blockIdx.y) % gridDim.x;
    int blk_y = blockIdx.x;

    int ix = blockDim.x * blk_x + threadIdx.x;
    int iy = blockDim.y * blk_y + threadIdx.y;

    if (ix < nx && iy < ny) {
        out[ix * ny + iy] = in[iy * nx + ix];
    }
}

__global__ void transposeDiagonalColRow(float *out, float *in, const int nx, const int ny) {
    int blk_x = (blockIdx.x + blockIdx.y) % gridDim.x;
    int blk_y = blockIdx.x;

    int ix = blockDim.x * blk_x + threadIdx.x;
    int iy = blockDim.y * blk_y + threadIdx.y;

    if (ix < nx && iy < ny) {
        out[iy * nx + ix] = in[ix * ny + iy];
    }
}

In the book, diagonal-transpose performs better than Cartesian coordinates in the case of read-by-row-store-by-col. Analysis:


In my experiments (GTX 1080 Ti), the degree of performance gain depends on execution configuration. Below are some results:

Matrix size: 2048 x 2048
X-axis: block size
Y-axis: execution time (ms)



Using shared memory can avoic uncoalesced gloal memory access: matrix transpose with shared memory


APPENDIX A - More on unified memory

Maximizing Unified Memory Performance in CUDA

Reference blog
Code

No pre-fetching; normal kernel

template <typename data_type>
__global__ void stream_thread(data_type *input, data_type *output, const int n) { 
    int tid = threadIdx.x + blockIdx.x * blockDim.x; 
    if (tid < n) {
        output[tid] = input[tid];
    }
}

... ...  

dim3 block(256);
dim3 grid((n + block.x - 1) / block.x);
stream_thread<float> <<<grid, block>>> (in, out, n);

Unified memory profiling result:

==12912== Unified Memory profiling result:
Device "GeForce GTX 1080 Ti (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      41  99.902KB  4.0000KB  0.9805MB  4.000000MB  678.1120us  Host To Device
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  1.294208ms  Device To Host
      24         -         -         -           -  2.319712ms  Gpu page fault groups
Total CPU Page faults: 36

Note: The number 24 above is not the total number of GPU faults, but rather the number of page fault groups. The faults are written to a special buffer in system memory and multiple faults forming a group are processed simultaneously by the Unified Memory driver. You can get the total number of faults for each group by specifying nvprof --print-gpu-trace ....

No pre-fetching; on warp per page

Q: Why there are so many GPU page faults?
A: In the kernel above, there are many difference warps visit each memory page. The driver must filter duplicate faults and transfer each page just once. Moreover, for this simple implementation very few different pages are accessed at the same time. Therefore, during fault processing the driver doesn't have enough information about what data can be migrated to the GPU.

We can modify the kernel to let one warp process one page. This decreases GPU page faults to a large degree.

#define PAGE_STRIDE 65536 // page size: 64K -> 65536 bytes

template <typename data_type>
__global__ void stream_warp(data_type *input, data_type *output, const int n) { 
    int laneId = threadIdx.x & 31; 
    int tid = threadIdx.x + blockIdx.x * blockDim.x; 
    int warpId = tid >> 5;
    size_t size = n * sizeof(data_type);
    int pages = (size + PAGE_STRIDE - 1) / PAGE_STRIDE; // how many pages
    if (warpId < pages) {
        #pragma unroll // tell compiler to specifically unroll a loop
        for(int rep = 0; rep < PAGE_STRIDE / sizeof(data_type) / 32; rep++) {
            int ind = warpId * PAGE_STRIDE / sizeof(data_type) + rep * 32 + laneId;
            if (ind < n)
                output[ind] = input[ind];
        }
    }
}

... ...  

int pages = (nBytes + PAGE_STRIDE - 1) / PAGE_STRIDE; // # warps = # pages
dim3 block(256);
dim3 grid((pages * 32 + block.x - 1) / block.x); // # warps = # pages
stream_warp<float> <<<grid, block>>> (in, out, n);

Unified memory profiling result:

==14320== Unified Memory profiling result:
Device "GeForce GTX 1080 Ti (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
     128  32.000KB  4.0000KB  60.000KB  4.000000MB  721.6640us  Host To Device
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  1.294112ms  Device To Host
       3         -         -         -           -  1.106208ms  Gpu page fault groups
Total CPU Page faults: 36

Pre-fetching

int device = -1;
cudaGetDevice(&device);
cudaMemPrefetchAsync(in,  nBytes, device, NULL);
cudaMemPrefetchAsync(out, nBytes, device, NULL);
dim3 block(256);
dim3 grid((n + block.x - 1) / block.x);
stream_thread<float> <<<grid, block>>> (in, out, n);

Unified memory profiling result:

==14492== Unified Memory profiling result:
Device "GeForce GTX 1080 Ti (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       2  2.0000MB  2.0000MB  2.0000MB  4.000000MB  652.4480us  Host To Device
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  1.293984ms  Device To Host
Total CPU Page faults: 36

The Min Size from host to device increase from 4KB to 2MB, due to Pre-fetching.

Note that there are still CPU page faults. If we also prefetch data everytime before accessing data on host side, there will be no CPU page fault either.

// allocate unified memory
float *in, *out;
cudaMallocManaged((float**)&in, nBytes);
cudaMallocManaged((float**)&out, nBytes);
cudaMemPrefetchAsync(in,  nBytes, cudaCpuDeviceId, 0);
initialData(in, n);

... ...

int device = -1;
cudaGetDevice(&device);
cudaMemPrefetchAsync(in,  nBytes, device, 0);
cudaMemPrefetchAsync(out, nBytes, device, 0);

dim3 block(256);
dim3 grid((n + block.x - 1) / block.x);
stream_thread<float> <<<grid, block>>> (in, out, n);
cudaDeviceSynchronize();

cudaMemPrefetchAsync(in,  nBytes, cudaCpuDeviceId, 0);
cudaMemPrefetchAsync(out, nBytes, cudaCpuDeviceId, 0);
verifyResult(out, in, n);

Overlapping kernels and prefetches

On-demand migration is powerful in the way it enables fine-grain overlap between data transfers and kernel execution. However, this overlap is severely limited due to the SM stalls caused by page fault handling. Even with very sophisticated driver prefetching heuristics, on-demand access with migration will never beat explicit bulk data copies or prefetches in terms of performance for large contiguous memory regions. This is the price for simplicity and ease of use.

(to be continued...)

⚠️ **GitHub.com Fallback** ⚠️