Indexing in OpenCL & CUDA - eecse4750/e4750_2024Fall_students_repo GitHub Wiki

Indexing in OpenCL & CUDA

Introduction

This wiki explores the mapping of threads to multidimensional data in both OpenCL and CUDA. Additionally, we compare terminology in both places. This is meant to purely be a quick reference. For a more comprehensive analysis, please refer to the Recommended Reading section.

PyCUDA PyOpenCL
Grid NDRange
Block Work-group
Thread Work-item

CUDA refers to the indices using the (x,y,z) system. OpenCL uses a indexing system based on the dimension of the data.

The highest dimension acquires the lowest index, ie, for a 1D setup, this would be x, for a 2D setup this would be the y dimension & for a 3D setup this would be the z setup.

Important Note: Therefore in OpenCL, in other words, when working with a

  • 1D array: the cols are represented by id(0),
  • 2D array: the rows are id(0), cols are id(1)
  • 3D array: the layers are id(0), rows are id(1) and cols are id(2).

index

Index in CUDA

We begin our discussion with a simple 2D array.

CUDA allows for the division of these arrays into thread blocks with multiple threads in them. These threads are aligned in the x, y & z dimensions. The blocks themselves are represented by the block indices (blockIdx) in each of those directions.

CUDA breaks this up into blocks over a grid network. As a pyCUDA user you can configure this using the block and grid parameters in the function call.

Consider the image. The array contains (say) N rows & M columns. In the above image, let us take N = 8 & M = 8. When defining a block as a collection of threads of (4,4,1), we essentially say that the block contains 4 threads in each of the X & Y directions. The threadIdx in each block iterates from 0 to 3 in the X direction & similarly in the Y direction.

Further, we divide our overall image into a grid of size (2,2,1) which implies that the overall workspace is divided into an area 2 blocks each in the X & Y direction. This basically implies that the blockIdx.x iterates from 0 to 1, similarly for the blockIdx.y [Please note that the (.....) parameters in the block and grid are represented as (x_dir, y_dir, z_dir) sizes.]

Now how do we access a random element from the array above?

Consider that any generic thread in the overall array may have a row index of "ty" & a col index of "tx" In C, this would be represented as A[y][x].

In the context of CUDA, this translates as:

ty = (blockIdx.y*blockDim.y) + threadIdx.y
tx = (blockIdx.x*blockDim.x) + threadIdx.x

An example of this (from a simple code snippet):

mod = SourceModule("""
__global__ void multiply_them(unsigned int *dest, unsigned int *a, unsigned int *b)
{
        const int DEPTH  = blockDim.z;
        const int HEIGHT = blockDim.y;
        const int WIDTH  = blockDim.x;
 
        const int layer = blockIdx.z*blockDim.z + threadIdx.z;
        const int row = blockIdx.y*blockDim.y + threadIdx.y;
        const int col = blockIdx.x*blockDim.x + threadIdx.x;
 
}
""")
 
multiply_them = mod.get_function("multiply_them")
 
Z = 12 # num of elems along levels "z"
Y = 4  # num of elems along height "y"
X = 6  # num of elems along width  "x"
 
tile_x = 2
tile_y = 2
tile_z = 2
 
block_x = numpy.ceil(X/tile_x) # number of blocks along X
block_y = numpy.ceil(Y/tile_y)
block_z = numpy.ceil(Z/tile_z)
 
### VERY VERY VERY IMPORTANT: Numpy & C (in general) index elements in Dimension....lowest Dimension order
a = numpy.random.randint(0,255,(Z,Y,X)).astype(numpy.uint32)
b = numpy.random.randint(0,255,(Z,Y,X)).astype(numpy.uint32)
 
### Note the grid & block alignments below:
dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(tile_x, tile_y,tile_z), grid=(block_x, block_y, block_z) )

Index in OpenCL

ndexing in OpenCL is more consistent with C.

Additionally, OpenCL does not define an explicit requirement as seen in CUDA above.

A way to understand indexing in OpenCL is as follows. "Idx 0" is taken on by the highest dimension. What that implies is that:

In a simple working environment,

1D arrays: int col = get_global_id(0); 2D arrays: int row = get_global_id(0); int col = get_global_id(1); 3D arrays: int lev = get_global_id(0); int row = get_global_id(1); int col = get_global_id(2);

This is a VERY important concept.

Further, OpenCL with its generic nature allows you to replicate the manipulation of work groups as CUDA allows the manipulation of thread blocks.

To replicate performance seen in CUDA,

blockIdx in CUDA can be replicated by the group_id in OpenCL threadIdx in CUDA can be replicated by the local_id in OpenCL

ie, you can replicate the behaviour of CUDA as follows: In a 2D system,

int blockId_y = get_group_id(0); // ID 0 goes to the highest dimension of your workspace. int blockId_x = get_group_id(1);

int threadId_y = get_local_id(0); int threadId_x = get_local_id(1);

Note that the global_id is in equivalent to the tx/ty we calculated in the case of CUDA.

index

The last point of focus is how the "grid" & "block" translate in the context of OpenCL.

OpenCL does not use an explicit "grid" variable, rather allowing you to define the complete workspace (global_size) and individual work group size (local_size), similar to the "block" in CUDA. Therefore the number of groups can be implied or identified using the values above.

The global worksize and the local worksize can be defined at the time of the kernel call.

Do review the sample below for your clarification.

prg = cl.Program(ctx, """
    __kernel void sum(__global const unsigned int *a,
    __global const unsigned int *b, __global unsigned int *c)
    {
 
        int LAY =get_global_size(0);
        int HEI =get_global_size(1);
        int WID =get_global_size(2);
 
        int lev = get_group_id(0)*get_local_size(0) + get_local_id(0);
        int row = get_group_id(1)*get_local_size(1) + get_local_id(1);
        int col = get_group_id(2)*get_local_size(2) + get_local_id(2);
 
 
 
    }
    """).build()
 
Z = 2 ## number of elems in levels "Z"
Y = 6 ## number of elems in height "Y"
X = 4 ## number of elems in width "X"
 
tile_shape = (2,2,2) # (tile_z, tile_y, tile_x)
 
a = numpy.random.randint(11,31,(Z,Y,X)).astype(numpy.uint32)
b = numpy.random.randint(11,31,(Z,Y,X)).astype(numpy.uint32)
 
a_dev = cl_array.to_device(queue, a)
b_dev = cl_array.to_device(queue, b)
dest_dev = cl_array.empty_like(a_dev)
 
 
## THIS IS THE KERNEL LAUNCH
## It's parameters are (queue, GLOBAL_SIZE, LOCAL_SIZE, .....function parameters...)
prg.sum(queue, a.shape, tile_shape, a_dev.data, b_dev.data, dest_dev.data)