PyCUDA Tutorial - eecse4750/e4750_2024Fall_students_repo GitHub Wiki
Introduction to PyCUDA
PyCUDA is a Python wrapper for Nvidia's CUDA API. The documentation is largely well maintained and should get you up to speed fairly quickly. Nevertheless, you may refer to this wiki article as a quick reference on how to get familiar with coding using PyCUDA.
PyCUDA
(may get deprecated/replaced by CUDA-Python)
Link: CUDA-Python
Demo Code
Below is a chunk of PyCUDA code for doubling the elements of an input numpy array.
"""
Double every element of an array using a CUDA kernel.
"""
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(4,4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print "original array:"
print a
print "doubled with kernel:"
print a_doubled
In-depth discussion of demo code
- The first step is to import PyCUDA and initialize it.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
As the documentation notes, you don't necessarily have to auto-initialize CUDA, but it is convenient. You could initialize, create context(s) and perform cleanup manually too.
- GPUs support only up to 32-bit floating point precision. So you must enforce 32 bit precision on your numpy arrays after creating them.
a = numpy.random.randn(4, 4)
a = a.astype(numpy.float32)
- The array must be copied to device memory so the kernel function can be called to perform an operation on it.
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
pycuda.compiler.SourceModule
creates a module in the CUDA source. The Nvidia CUDA compiler (NVCC) is invoked to compile the code defined.
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
Enclosed within the SourceModule call is the CUDA/C kernel code that defines per-thread behavior. How you define the indices gives full control.
Keep in mind that the numpy array a
is a 2D array, which can be thought of as a 1D array where every element is in turn, a 1D array. With this visualization, a per-thread indexing scheme can be devised.
The above kernel contains the global function doublify
. In the row-major ordering scheme for the array, for a block-size restricted by the array dimensions (i.e [4, 4, 1]), the indexing in the kernel must reflect this.
- The compiled kernel code can be called by using
mod.get_function
.
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
# blocksize=(4, 4, 1)
- The result must be fetched from the device (GPU).
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)