Comparison of existing implementation - abergeron/compyte GitHub Wiki
functionality | gpu nd array(python interface) | Theano CudaNdarray | GPUmat GPU(single/double) |
backend | cuda/opencl | cuda | cuda |
dtype | float32 {u}int{8,16,32,64} complex64 (float64 and complex128 possible) | float32 | float32, complex32, float64, complex64 |
ndim | generic | generic | generic |
memory layout | generic | generic | generic |
contiguous transfer to/from gpu | Yes | Yes | Yes |
not contiguous transfer to/from gpu | copy if needed | copy if needed | copy if needed |
ascontiguousarray | Yes | No | No |
asfortranarray | Yes | No | No |
copy | Yes | Yes | Yes, clone() |
zeros | Yes | Yes | Yes |
empty | Yes | No | Yes: GPUsingle();setSize();GPUallocVector() |
len | Yes | Yes | Yes: length() |
subtensor(var[…]) | Yes | Yes | Yes |
subtensor(var[N]) | Yes | Yes | Yes |
subtensor(var[strides with step]) | Yes | Yes | Yes |
subtensor(var[strides with neg start/stop/step]) | Yes | Yes | Yes |
subtensor(var[ tuple with mix of slice, integer and numpy.int64]) | Yes | Yes | No |
elemwise | generic with dimensions collapsing, mixed dtype | as gpu nd array | as gpu nd array |
elemwise with broadcasting | Yes | Yes | Yes |
reduction | sum/prod generic for ndim and any combination of reduced axis | sum only with this pattern: 1, 11, 10, 01, 001, 010, 100, 110, 011, 111, 0011, 0101, 0111, 1011, 1111, pattern 1+ use only 1 block | sum |
__setitem__ | Yes (with broadcast if necessary) | The value must be a CudaNdarray(no broadcasting done). When the destination is c contiguous the value can be 0(memset) or an ndarray(tranfer) | Yes: subsasgn(), assign() |
reshape | Yes (copy when numpy would copy) | Yes (copy if not c_contiguous) | Yes: setSize(), reshape() |
n-dim transpose | Yes | Yes(can add dim with shape 1 at the same time) | No |
dot/gemm | Yes* | Theano op | Yes: times(), GPUtimes() |
gemv | Yes* | Theano op | ? |
It need an external blas, that is included with CUDA. For OpenCL back-end you can use clmath, but clmath support isn’t good on Mac and Windows.
No done but planned in gpu nd array.
ones | No | Theano op only | Yes |
subtensor with a list of index var[1,2,3,4] (part of numpy advanced indexing | No | in a branch | Yes: slice(A, {[1,2,3,4]}) |
reduction (max, min, argmax) | No | No | No |
ger | No | Theano op | ? |
flatten | No(you can use reshape for this) | Yes | ? |
random | No | Theano op only with our own implementation | Yes: GPUrand(), GPUrandn() |
join | No | Theano op | ? |
Other Theano op: CrossentropySoftmaxArgmax1HotWithBias, CrossentropySoftmax1HotWithBiasDx, Softmax, SoftmaxWithBias, DownsampleFactorMax, GpuImages2Neibs, Dot22SCalar, GpuEye, ErfinvGPU
gnumpy: as_garray, as_garray_or_scalar, as_numpy_array, tile(the same as numpy?), rand, randn, empty, zeros, ones, seed_rand, dot(0d,*d), dot(1d,1d), dot(1d,2d) dot(2d,1d), dot(2d,2d), dot(a1.ndim >= 2, a2.ndim >= 2) with reshape and transpose(transpose done by a loop?), outer, concatenate, where, nonzero, support newaxis?, eye, diagflat, tensordot, reduction(all, any, sum, mean, max, min, (prod and std cpu only)), elemwise(abs, exp, isinf, isnan, log, log_1_plus_exp, logistic, negative, sign, sqrt, tanh, (cpu only: log10)) gnumpy.garray fct: as_numpy_array, astype, ravel(call self.reshape(-1)), item(transfert to cpu), sort(cpu only), reshape_2d, T, transpose, shiftAxesRight, copy, diagflat, diagonal, diag, all_real, isinf, isreal, isnan, isnumber, abs, as_bool, exp, log, log_1_plus_exp, logistic, sigmoid, sign, sqrt, tanh, sum, mean, max, argmax(cpu), argmin(cpu), min, all, any, all2, any2, rand, euclid_norm, dot, where, nonzero, __lt__, gt, le, ge, ne, eq, sub, div, rmul, radd, rsub, rdiv, rpow, pos, neg, iadd, imul, isub, idiv, imod, ipow, len, getitem, iter, __setitem__