OpenCL performance tips - ProkopHapala/FireCore GitHub Wiki

Flip-floppin

Consider we have simple kernel like this:

__kernel update(float4* pos, int4* neighs){
     int    i    = get_global_id(0);
     int4   ng   = neighs[i]; 
     float3 f    = force_func( pos[i], pos[ng.x] ) 
                 + force_func( pos[i], pos[ng.y] ) 
                 + force_func( pos[i], pos[ng.z] ) 
                 + force_func( pos[i], pos[ng.w] );
     pos[i].xyz += force[j]; 
}

The problem is that state of pos[ng.?] can be changed by another thread so we do not know if we access the value before or after update. The simple solution of this problem is to split this kernel into two (1) evaluate force for all particles (2) update the positions. However such approach has a disadvantage - it require two clEnqueueNDRangeKernel() instead of one, which may pose a bottleneck for some problems which are limited by CPU-GPU communication. To minimize number of kernel enque calls we can use two copies of the buffer.

OpenCL side:

__kernel update(float4* pos, float4* posOut, int4* neighs){
     int    i    = get_global_id(0);
     int4   ng   = neighs[i]; 
     float3 f    = force_func( pos[i], pos[ng.x] ) 
                 + force_func( pos[i], pos[ng.y] ) 
                 + force_func( pos[i], pos[ng.z] ) 
                 + force_func( pos[i], pos[ng.w] );
     posOut[i].xyz += force[j]; 
}

C++ side:

task1 = setup_update( pos1, pos2, neighs );
task2 = setup_update( pos2, pos1, neighs );

for(int itr=0; itr<niter; itr++){
   task1.enque();
   task2.enque();
}

Device-side enque

In Molecular dynamics we often need to run rather short kernels in long loops (i.e. many iterations), e.g. like this

for(int itr=0;itr<niter;itr++){
    ocl.task_evalForce.enque();
    ocl.task_moveAtoms.enque();
}
ocl.finishRaw();
ocl.download(ibuff_apos, apos);

This cause serious overhead just by calling kernell enque() by the processor. Instead in OpenCL2.0 we can do enque within another directly kernell on OpenCL device by calling enqueue_kernel() function within .cl program.

More Info

Example

Below are some examples of how to enqueue a block.

kernel void my_func_A(global int *a, global int *b, global int *c){ ... }

kernel void my_func_B(global int *a, global int *b, global int *c){
    ndrange_t ndrange; 
    // build ndrange information
    ...
    // example – enqueue a kernel as a block
    enqueue_kernel(get_default_queue(), ndrange, ^{my_func_A(a, b, c);});
    ...
}

kernel void my_func_C(global int *a, global int *b, global int *c){
    ndrange_t ndrange;
    // build ndrange information
    ...
    // note that a, b and c are variables in scope of the block
    void (^my_block_A)(void) = ^{my_func_A(a, b, c);};
    // enqueue the block variable
    enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,  my_block_A);
    ...
}

The example below shows how to declare a block literal and enqueue it.

kernel void my_func(global int *a, global int *b) {
    ndrange_t ndrange;
    // build ndrange information
    ...
    // note that a, b and c are variables in scope of the block
    void (^my_block_A)(void) =  ^{ size_t id = get_global_id(0);  b[id] += a[id]; };

// enqueue the block variable
enqueue_kernel(get_default_queue(),  CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,  my_block_A);

// or we could have done the following
enqueue_kernel(get_default_queue(),  CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,^{ size_t id = get_global_id(0); b[id] += a[id]; }; 
}

NOTE: Blocks passed to enqueue_kernel cannot use global variables or stack variables local to the enclosing lexical scope that are a pointer type in the local or private address space.

kernel voidfoo(global int *a, local int *lptr, …){
    enqueue_kernel(get_default_queue(),CLK_ENQUEUE_FLAGS_WAIT_KERNEL,ndrange, 
           ^{  size_t id = get_global_id(0);local int *p = lptr; // undefined behavior
            };
}