cudaMallocPitch and cudaMemcpy2D - OrangeOwlSolutions/General-CUDA-programming GitHub Wiki
When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.
CUDA provides the cudaMallocPitch
function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.
Assuming that we want to allocate a 2D padded array of Nrows x Ncols
floating point (single precision) elements, the syntax for cudaMallocPitch
is the following:
cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);
where
devPtr
is an output pointer to float (float *devPtr
) pointing to the allocated memory space;devPitch
is asize_t
output variable denoting the length, in bytes, of the padded row;Nrows
andNcols
aresize_t
input variables representing the matrix size.
Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch
will allocate a memory space of size, in bytes, equal to Nows * pitch
. However, only the first Ncols * sizeof(float)
bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch
consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.
CUDA provides also the cudaMemcpy2D
function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch
. Under the above hypotheses (single precision 2D matrix), the syntax is the following:
cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)
where
devPtr
andhostPtr
are input pointers to float (float *devPtr
andfloat *hostPtr
) pointing to the (source) device and (destination) host memory spaces, respectively;devPitch
andhostPitch
aresize_t
input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;Nrows
andNcols
aresize_t
input variables representing the matrix size.
Note that cudaMemcpy2D
allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPitch = Ncols * sizeof(float)
. Furthermore, cudaMemcpy2D
is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to
cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)
The access to elements of a 2D matrix allocated by cudaMallocPitch
can be performed as in the following example
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
In such an example, tidx
and tidy
are used as column and row indices, respectively (remember that, in CUDA, x
-threads span the columns and y
-threads span the rows to favor coalescence). The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr
by the row length tidy * pitch
in bytes (char *
is a pointer to bytes and sizeof(char)
is 1
byte), where the length of each row is computed by using the pitch information.
At cudaMallocPitch_and_cudaMemcpy2D.cu
, we provide a fully worked example to show these concepts.
Finally, please, notice that CUDA makes the cudaMallocPitch
available to allocate padded 2D arrays and cudaMalloc3D
to allocate padded 3D arrays. However, a “cudaMalloc2D
” function does not exist, its role being accomplished by cudaMallocPitch
.
The final question regards whether the use of pitched memory allocation, i.e., memory allocation done with cudaMallocPitch
, really leads to improved performance as compared to non-pitched memory allocation, i.e., done with cudaMalloc
. Actually, the improvements arising from the use of cudaMallocPitch
depend on the compute capability and are expected to be more significant for older ones. However, for most recent compute capabilities, pitched memory allocation does not seem to lead to a relevant speedup.
The code at cudaMallocPitch_performance.cu
provides a performance testbench between the uses of non-pitched and pitched memories. In particular, the code performs the summation between three (non-pitched or pitched) matrices. The reason for dealing with three matrices is the need to highlight memory transactions as compared to computation, so to highlight the differences between non-pitched and pitched allocations. Below are the timing results for a GTX 960
card and a GT 920M
cards.
GTX 960
Non-pitched - Time = 3.242208; Memory = 65320000 bytes
Pitched - Time = 3.150944; Memory = 65433600 bytes
GT 920M
Non-pitched - Time = 20.496799; Memory = 65320000 bytes
Pitched - Time = 20.418560; Memory = 65433600 bytes
As it can be seen, there is not much difference in the two implementations for the two cards. The above results also show the increase in memory occupancy due to the use of pitched memory allocation.