# PyCUDA Tutorial - eecse4750/e4750_2024Fall_students_repo GitHub Wiki

# Introduction to PyCUDA

PyCUDA is a Python wrapper for Nvidia's CUDA API. The documentation is largely well maintained and should get you up to speed fairly quickly. Nevertheless, you may refer to this wiki article as a quick reference on how to get familiar with coding using PyCUDA.

# PyCUDA

(may get deprecated/replaced by CUDA-Python)

Link: CUDA-Python

## Demo Code

Below is a chunk of PyCUDA code for doubling the elements of an input numpy array.

```
"""
Double every element of an array using a CUDA kernel.
"""
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(4,4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print "original array:"
print a
print "doubled with kernel:"
print a_doubled
```

## In-depth discussion of demo code

- The first step is to import PyCUDA and initialize it.

```
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
```

As the documentation notes, you don't necessarily have to auto-initialize CUDA, but it is convenient. You *could* initialize, create context(s) and perform cleanup manually too.

- GPUs support only up to 32-bit floating point precision. So you must enforce 32 bit precision on your numpy arrays after creating them.

```
a = numpy.random.randn(4, 4)
a = a.astype(numpy.float32)
```

- The array must be copied to device memory so the kernel function can be called to perform an operation on it.

```
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
```

`pycuda.compiler.SourceModule`

creates a module in the CUDA source. The Nvidia CUDA compiler (NVCC) is invoked to compile the code defined.

```
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
```

Enclosed within the SourceModule call is the CUDA/C kernel code that defines per-thread behavior. How you define the indices gives full control.

Keep in mind that the numpy array `a`

is a 2D array, which can be thought of as a 1D array where every element is in turn, a 1D array. With this visualization, a per-thread indexing scheme can be devised.

The above kernel contains the global function `doublify`

. In the row-major ordering scheme for the array, for a block-size restricted by the array dimensions (i.e [4, 4, 1]), the indexing in the kernel must reflect this.

- The compiled kernel code can be called by using
`mod.get_function`

.

```
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
# blocksize=(4, 4, 1)
```

- The result must be fetched from the device (GPU).

```
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
```