Getting Started with PyOpenCL - eecse4750/e4750_2024Fall_students_repo GitHub Wiki

Getting Started with PyOpenCL

Introduction for PyOpenCL

PyOpenCL is a Python package that enables you to write parallel programs in OpenCL while taking advantage of Python's ease of use and extensive ecosystem of useful packages. While you still need to write your kernels in C code, PyOpenCL obviates the need to write most of the "boilerplate" code required by an OpenCL program written entirely in C or C++.

PyOpenCL has been installed in /opt/PYTHON. You can immediately make use of it in your programs.

Note that there are two different OpenCL platforms on tesseract; the Intel platform provides access to the server's CPU, while the NVIDIA platform provides access to the GPUs. Your program needs to select which platform and device to use.

PyOpenCL Example - Vector Addition

Here is a simple example of a PyOpenCL program that performs vector addition.

"""
Vector addition using PyOpenCL.
"""

import time

import pyopencl as cl
import numpy as np

# Select the desired OpenCL platform; you shouldn't need to change this:
NAME = 'NVIDIA CUDA'
platforms = cl.get_platforms()
devs = None
for platform in platforms:
    if platform.name == NAME:
        devs = platform.get_devices()

# Set up a command queue:
ctx = cl.Context(devs)
queue = cl.CommandQueue(ctx)

# Define the OpenCL kernel you wish to run; most of the interesting stuff you
# will be doing involves modifying or writing kernels:
kernel = """
__kernel void func(__global float* a, __global float* b, __global float* c) {
    unsigned int i = get_global_id(0);
    c[i] = a[i]+b[i];
}
"""

# Load some random data to process. Note that setting the data type is important;
# if your data is stored using one type and your kernel expects a different type,
# your program might either produce the wrong results or fail to run.
# Note that Numerical Python uses names for certain types that differ from those
# used in OpenCL. For example, np.float32 corresponds to the float type in OpenCL:
N = 16
a = np.random.rand(N).astype(np.float32)
b = np.random.rand(N).astype(np.float32)

# You need to set the flags of the buffers you create properly; otherwise,
# you might not be able to read or write them as needed:
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)

# Launch the kernel; notice that you must specify the global and locals to
# determine how many threads of execution are run. We can take advantage of Numpy to
# use the shape of one of the input arrays as the global size. Since our kernel
# only accesses the global work item ID, we simply set the local size to None:
prg = cl.Program(ctx, kernel).build()
prg.func(queue, a.shape, None, a_buf, b_buf, c_buf)

# Retrieve the results from the GPU:
c = np.empty_like(a)
cl.enqueue_copy(queue, c, c_buf)

print 'input (a):    ', a
print 'input (b):    ', b
print 'numpy (a+b):  ', a+b
print 'opencl (a+b): ', c

# Compare the results from the GPU with those obtained using Numerical Python;
# this should print True:
print 'equal:        ', np.allclose(a+b, c)

# Here we compare the speed of performing the vector addition with Python and
# PyOpenCL. Since the execution speed of a snippet of code may vary slightly at
# different times depending on what other things the computer is running, we run
# the operation we wish to time several times and average the results:
M = 3
times = []
for i in xrange(M):
    start = time.time()
    a+b
    times.append(time.time()-start)
print 'python time:  ', np.average(times)

times = []
for i in xrange(M):
    start = time.time()
    prg.func(queue, a.shape, None, a_buf, b_buf, c_buf)
    times.append(time.time()-start)
print 'opencl time:  ', np.average(times)

# Notice that the execution time of the Python code is actually shorter than
# that of the PyOpenCL code for very short arrays. This is because data
# transfers between host memory and that of the GPU are relatively slow. Try
# gradually increasing the number of elements in a and b up to 100000 and see
# what happens.

VectorAdd written in PyOpenCL using Array class

PyOpenCL provides a class called pyopencl.array.Array that provides a numpy-like interface to GPU memory. Here is the above example written to take advantage of it.

#!/usr/bin/env python
"""
Vector addition using PyOpenCL.
"""

import time

import pyopencl as cl
import pyopencl.array
import numpy as np

# Select the desired OpenCL platform; you shouldn't need to change this:
NAME = 'NVIDIA CUDA'
platforms = cl.get_platforms()
devs = None
for platform in platforms:
    if platform.name == NAME:
        devs = platform.get_devices()

# Set up a command queue; we need to enable profiling to time GPU operations:
ctx = cl.Context(devs)
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)

# Define the OpenCL kernel you wish to run; most of the interesting stuff you
# will be doing involves modifying or writing kernels:
kernel = """
__kernel void func(__global float* a, __global float* b, __global float* c) {
    unsigned int i = get_global_id(0);
    c[i] = a[i]+b[i];
}
"""

# Load some random data to process. Note that setting the data type is
# important; if your data is stored using one type and your kernel expects a
# different type, your program might either produce the wrong results or fail to
# run.  Note that Numerical Python uses names for certain types that differ from
# those used in OpenCL. For example, np.float32 corresponds to the float type in
# OpenCL:
N = 16
a = np.random.rand(N).astype(np.float32)
b = np.random.rand(N).astype(np.float32)

# We can use PyOpenCL's Array type to easily transfer data from numpy arrays to
# GPU memory (and vice versa):
a_gpu = cl.array.to_device(queue, a)
b_gpu = cl.array.to_device(queue, b)
c_gpu = cl.array.empty(queue, a.shape, a.dtype)

# Launch the kernel; notice that you must specify the global and locals to
# determine how many threads of execution are run. We can take advantage of Numpy to
# use the shape of one of the input arrays as the global size. Since our kernel
# only accesses the global work item ID, we simply set the local size to None:
prg = cl.Program(ctx, kernel).build()
prg.func(queue, a.shape, None, a_gpu.data, b_gpu.data, c_gpu.data)

# Retrieve the results from the GPU:
c = c_gpu.get()

print 'input (a):    ', a
print 'input (b):    ', b
print 'numpy (a+b):  ', a+b
print 'opencl (a+b): ', c

# Compare the results from the GPU with those obtained using Numerical Python;
# this should print True:
print 'equal:        ', np.allclose(a+b, c)

# Here we compare the speed of performing the vector addition with Python and
# PyOpenCL. Since the execution speed of a snippet of code may vary slightly at
# different times depending on what other things the computer is running, we run
# the operation we wish to time several times and average the results:
M = 3
times = []
for i in xrange(M):
    start = time.time()
    a+b
    times.append(time.time()-start)
print 'python time:  ', np.average(times)

times = []
for i in xrange(M):
    evt = prg.func(queue, a.shape, None, a_gpu.data, b_gpu.data, c_gpu.data)
    evt.wait()
    times.append(1e-9*(evt.profile.end-evt.profile.start))
print 'opencl time:  ', np.average(times)

# Notice that the execution time of the Python code is actually shorter than
# that of the PyOpenCL code for very short arrays. This is because data
# transfers between host memory and that of the GPU are relatively slow. Try
# gradually increasing the number of elements in a and b up to 100000 and see
# what happens.

VectorAdd written in OpenCL only

For the sake of comparison, here is the same program written entirely in C. To build the program, run the following command:

gcc -I/usr/local/cuda/include -L/usr/local/cuda/lib64 vector_add.c -o vector_add -lOpenCL
/* Vector addition using OpenCL. */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <CL/cl.h>

const char *kernel_src =
"    __kernel void func(__global float* a, __global float* b, __global float* c) {\n"
"    unsigned int i = get_global_id(0);\n"
"    c[i] = a[i]+b[i];\n"
"}\n";

void vector_add(float x[], float y[], float z[], unsigned int N) {
    unsigned int i;
    for (i=0; i<N; i++)
        z[i] = x[i]+y[i];
}

int main(void) {

    /* Get available platforms: */
    cl_platform_id *platforms = NULL;
    cl_uint num_platforms;

    cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);
    platforms = (cl_platform_id *) malloc(sizeof(cl_platform_id)*num_platforms);
    clStatus = clGetPlatformIDs(num_platforms, platforms, NULL);

    /* Find the NVIDIA CUDA platform: */
    unsigned int i;
    char queryBuffer[1024];
    for (i=0; i<num_platforms; i++) {
        clStatus = clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME,
                          1024, &queryBuffer, NULL);
        if (strcmp(queryBuffer, "NVIDIA CUDA") == 0)
            break;
    }

    /* Get the GPU devices for the selected platform: */
    cl_device_id *device_list = NULL;
    cl_uint num_devices;
    clStatus = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU,
                              0, NULL, &num_devices);
    device_list = (cl_device_id *) malloc(sizeof(cl_device_id)*num_devices);
    clStatus = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU,
                              num_devices, device_list, NULL);

    /* Create context: */
    cl_context context;
    context = clCreateContext(NULL, num_devices, device_list,
                              NULL, NULL, &clStatus);

    /* Create command queue: */
    cl_command_queue command_queue = clCreateCommandQueue(context, device_list[0],
                                                          0, &clStatus);

    /* Load some random data to process: */
    unsigned int N = 16;
    float a[N];
    float b[N];
    for (i=0; i<N; i++) {
        a[i] = (float) rand()/(float) RAND_MAX;
        b[i] = (float) rand()/(float) RAND_MAX;
    }

    /* Create buffers: */
    cl_mem a_buf = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                  N*sizeof(float), a, &clStatus);
    cl_mem b_buf = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                  N*sizeof(float), b, &clStatus);
    cl_mem c_buf = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
                                  N*sizeof(float), NULL, &clStatus);

    /* Create program: */
    cl_program program = clCreateProgramWithSource(context, 1, (const char **) &kernel_src,
                                                   NULL, &clStatus);

    /* Build kernel: */
    cl_kernel kernel = clCreateKernel(program, "kernel", &clStatus);

    /* Set the kernel arguments: */
    clStatus = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &a_buf);
    clStatus = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &b_buf);
    clStatus = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *) &c_buf);

    /* Execute the kernel: */
    size_t global_size = N;
    size_t local_size = 1;
    clStatus = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
                                      &global_size, &local_size, 0, NULL, NULL);

    /* Retrieve the result: */
    float c[N];
    clStatus = clEnqueueReadBuffer(command_queue, c_buf, CL_TRUE, 0,
                                   N*sizeof(float), c, 0, NULL, NULL);

    /* Wait for all commands in the queue to complete: */
    clStatus = clFlush(command_queue);
    clStatus = clFinish(command_queue);

    /* Compute result using Python: */
    float c_py[N];
    vector_add(a, b, c_py, N);

    /* Verify that the result is correct: */
    unsigned int is_true = 0;
    for (i=0; i<N; i++) {
        if (c[i] != c_py[i]) {
            is_true = 1;
            break;
        }
    }
    printf("equal:      ");
    if (is_true) {
        printf("True\n");
    } else {
        printf("False\n");
    }

    /* Compare performance: */
    unsigned int M = 3;
    clock_t timing = 0;
    clock_t start;
    for (i=0; i<M; i++) {
        start = clock();
        vector_add(a, b, c_py, N);
        timing += clock()-start;
    }
    printf("c time:      %.10f\n", (double)timing/(CLOCKS_PER_SEC*M));

    timing = 0;
    for (i=0; i<M; i++) {
        start = clock();
        clStatus = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
                                          &global_size, &local_size, 0, NULL, NULL);
        clStatus = clFlush(command_queue);
        clStatus = clFinish(command_queue);
        timing += clock()-start;
    }
    printf("opencl time: %.10f\n", (double)timing/(CLOCKS_PER_SEC*M));

    /* Clean up: */
    clStatus = clReleaseKernel(kernel);
    clStatus = clReleaseProgram(program);
    clStatus = clReleaseMemObject(c_buf);
    clStatus = clReleaseMemObject(b_buf);
    clStatus = clReleaseMemObject(a_buf);
    clStatus = clReleaseCommandQueue(command_queue);
    clStatus = clReleaseContext(context);
    free(device_list);
    free(platforms);
    return 0;
}

Debugging of PyOpenCL

If an error occurs during kernel compilation, PyOpenCL will raise an exception and report where the error occurred in the kernel. For example, if you remove the semicolon from the line c[i] = a[i]+b[i]; above (which is a syntax error), you will get an error message similar to the following:

Traceback (most recent call last):
  File "vector_add.py", line 42, in <module>
    prg = cl.Program(ctx, kernel).build()
  File "/opt/PYTHON/local/lib/python2.7/site-packages/pyopencl-2014.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py", line 209, in build
    options=options, source=self._source)
  File "/opt/PYTHON/local/lib/python2.7/site-packages/pyopencl-2014.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py", line 249, in _build_and_catch_errors
    raise err
pyopencl.RuntimeError: clBuildProgram failed: build program failure -

Build on <pyopencl.Device 'Tesla K40c' on 'NVIDIA CUDA' at 0x157d240>:

:4:21: error: expected ';' after expression
    c[i] = a[i]+b[i]
                    ^
                    ;

(options: -I /opt/PYTHON/lib/python2.7/site-packages/pyopencl-2014.1-py2.7-linux-x86_64.egg/pyopencl/cl)
(source saved as /tmp/tmp7lVUfS.cl)

Notice also that PyOpenCL tells you where the temporary file containing the broken kernel is stored (/tmp/tmp7lVUfS.cl); you can open this file in a text editor to find the erroneous line reported by PyOpenCL.

⚠️ **GitHub.com Fallback** ⚠️