Home - abergeron/compyte GitHub Wiki

Goal of compyte

Make a common GPU ndarray(matrix/tensor or n dimensions) that can be reused by all projects.

Mailing list

Development/user mailing list: http://lists.tiker.net/listinfo/gpundarray
Announce mailing list(low volume): http://lists.tiker.net/listinfo/gpundarray-announce

Comparison of existing implementation

Branch

The current development that include a real C back-end and full support for OpenCL is in this branch: https://github.com/abergeron/compyte/tree/reorg

Motivation

Currently there are at least 6 different gpu arrays in python
- CudaNdarray(Theano), GPUArray(pycuda), CUDAMatrix(cudamat), GPUArray(pyopencl), Clyther, Copperhead, ...
- There are even more if we include other languages.
They are incompatible
- None have the same properties and interface
All of them are a subset of numpy.ndarray on the gpu!

Lack of Standard Creates Problems:

Duplicates work
- GPU code is harder/slower to do correctly and fast than on the CPU/python
Harder to port/reuse code
Harder to find/distribute code
Divides development work

Pitfalls to Avoid

Start alone
- We need different people/groups to "adopt" the new GpuNdArray
Too simple - other projects won't adopt
Too general - other projects will implement "light" versions... and not adopt
- Having an easy way to convert/check conditions as numpy could alleviate this.

The preferred option is to have a general version with easy check/conversion to allow supporting only a subset!

Design Goals

Make it VERY similar to numpy.ndarray
- Easier to attract other people from python community
Have the base object in C to allow collaboration with more projects.
- We want people from C, C++, ruby, R, ... all use the same base Gpu ndarray.
Be compatible with CUDA and OpenCL

Current behavior not wanted

No CPU code generated from the python interface (for PyOpenCL and PyCUDA). Gpu code is OK.

Implementation plan

All of the basic C code is done. Currently working on elementwise functionality in prevision of a PyOpenCL/PyCUDA integration.

Sketch of the file structure and the reasoning behind it

This section will detail the file structure and give you a hint of what to expect if you intent on shipping a project integrating this code. Also this applies to the code in the reorg branch which will become the mainline soon. It is located here: http://github.com/abergeron/compyte/tree/reorg

Some of these files are not in the repository yet, which means that this functionality is being worked on.

The main files are:

ndarray/compyte_buffer.h:
- Defines the base compyte_buffer object
- Also defines the structure for GpuArray and GpuKernel
ndarray/compyte_buffer_cuda.c:
- Implements the CUDA version of the compyte_buffer API
ndarray/compyte_buffer_opencl.c:
- Implements the OpenCL version of the compyte_buffer API
ndarray/pygpu_ndarray.pyx
- Define a Cython wrapper that exposes the GpuArray object and a couple of function to mimic the interface of numpy.ndarray
elemwise.py:
- Support running arbitrary elementwise kernels on GpuArray of arbitrary memory layout (python-only).

These files serve as support for the functionality above:

ndarray/compyte_types.{c,h}:
- generated by ndarray/gen_types.py
- serve as a type table for operations that need to know some information about types involved
ndarray/compyte_util.{c,h}:
- some generally useful functions that don't really fit anywhere else.
ndarray/setup.py:
- Builds the python module implemented in pygpu_ndarray.pyx along with all the supporting code

These files serve for portability (mainly to support windows):

ndarray/compyte_compat.h
ndarray/compyte_mkstemp.c
ndarray/compyte_strl.c
ndarray/wincompat/*

Some tests for the python interface (that also test the underlying C code):

ndarray/test_gpu_ndarray.py (test basic functionality: init, copy, indexing, ...)
tests/test_elemwise.py (test that the numpy-like elemwise operations on array work correctly)

Some gotchas and differences from numpy

We have the updateifcopy flag as numpy, but it is always False and we expect it is False.
Buffer offsets (like what is generated when you do a[1:3]), are only partially supported under OpenCL 1.0. You cannot run kernels on them without copying them beforehand.