PyCUDA Tutorial - eecse4750/e4750_2024Fall_students_repo GitHub Wiki

Introduction to PyCUDA

PyCUDA is a Python wrapper for Nvidia's CUDA API. The documentation is largely well maintained and should get you up to speed fairly quickly. Nevertheless, you may refer to this wiki article as a quick reference on how to get familiar with coding using PyCUDA.

PyCUDA

(may get deprecated/replaced by CUDA-Python)

Link: CUDA-Python

Demo Code

Below is a chunk of PyCUDA code for doubling the elements of an input numpy array.

"""
Double every element of an array using a CUDA kernel.
"""
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

import numpy
a = numpy.random.randn(4,4)

a = a.astype(numpy.float32)

a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)

cuda.memcpy_htod(a_gpu, a)

mod = SourceModule("""
    __global__ void doublify(float *a)
    {
      int idx = threadIdx.x + threadIdx.y*4;
      a[idx] *= 2;
    }
    """)

func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))

a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print "original array:"
print a
print "doubled with kernel:"
print a_doubled

In-depth discussion of demo code

The first step is to import PyCUDA and initialize it.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

As the documentation notes, you don't necessarily have to auto-initialize CUDA, but it is convenient. You could initialize, create context(s) and perform cleanup manually too.

GPUs support only up to 32-bit floating point precision. So you must enforce 32 bit precision on your numpy arrays after creating them.

a = numpy.random.randn(4, 4)
a = a.astype(numpy.float32)

The array must be copied to device memory so the kernel function can be called to perform an operation on it.

a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)

pycuda.compiler.SourceModule creates a module in the CUDA source. The Nvidia CUDA compiler (NVCC) is invoked to compile the code defined.

mod = SourceModule("""
    __global__ void doublify(float *a)
    {
      int idx = threadIdx.x + threadIdx.y*4;
      a[idx] *= 2;
    }
    """)

Enclosed within the SourceModule call is the CUDA/C kernel code that defines per-thread behavior. How you define the indices gives full control.

Keep in mind that the numpy array a is a 2D array, which can be thought of as a 1D array where every element is in turn, a 1D array. With this visualization, a per-thread indexing scheme can be devised.

The above kernel contains the global function doublify. In the row-major ordering scheme for the array, for a block-size restricted by the array dimensions (i.e [4, 4, 1]), the indexing in the kernel must reflect this.

The compiled kernel code can be called by using mod.get_function.

func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
# blocksize=(4, 4, 1)

The result must be fetched from the device (GPU).

a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)