OpenCL performance tips - ProkopHapala/FireCore GitHub Wiki
Flip-floppin
Consider we have simple kernel like this:
__kernel update(float4* pos, int4* neighs){
int i = get_global_id(0);
int4 ng = neighs[i];
float3 f = force_func( pos[i], pos[ng.x] )
+ force_func( pos[i], pos[ng.y] )
+ force_func( pos[i], pos[ng.z] )
+ force_func( pos[i], pos[ng.w] );
pos[i].xyz += force[j];
}
The problem is that state of pos[ng.?]
can be changed by another thread so we do not know if we access the value before or after update.
The simple solution of this problem is to split this kernel into two (1) evaluate force for all particles (2) update the positions. However such approach has a disadvantage - it require two clEnqueueNDRangeKernel()
instead of one, which may pose a bottleneck for some problems which are limited by CPU-GPU communication. To minimize number of kernel enque calls we can use two copies of the buffer.
OpenCL side:
__kernel update(float4* pos, float4* posOut, int4* neighs){
int i = get_global_id(0);
int4 ng = neighs[i];
float3 f = force_func( pos[i], pos[ng.x] )
+ force_func( pos[i], pos[ng.y] )
+ force_func( pos[i], pos[ng.z] )
+ force_func( pos[i], pos[ng.w] );
posOut[i].xyz += force[j];
}
C++ side:
task1 = setup_update( pos1, pos2, neighs );
task2 = setup_update( pos2, pos1, neighs );
for(int itr=0; itr<niter; itr++){
task1.enque();
task2.enque();
}
Device-side enque
In Molecular dynamics we often need to run rather short kernels in long loops (i.e. many iterations), e.g. like this
for(int itr=0;itr<niter;itr++){
ocl.task_evalForce.enque();
ocl.task_moveAtoms.enque();
}
ocl.finishRaw();
ocl.download(ibuff_apos, apos);
This cause serious overhead just by calling kernell enque() by the processor. Instead in OpenCL2.0 we can do enque within another directly kernell on OpenCL device by calling enqueue_kernel() function within .cl program.
More Info
- OpenCL 2.0 is here
- Device-side enqueue causes CL_OUT_OF_RESOURCES
- GPU-Quicksort in OpenCL 2.0 Nested Parallelism
- What does ^{} mean in C?
Example
Below are some examples of how to enqueue a block.
kernel void my_func_A(global int *a, global int *b, global int *c){ ... }
kernel void my_func_B(global int *a, global int *b, global int *c){
ndrange_t ndrange;
// build ndrange information
...
// example – enqueue a kernel as a block
enqueue_kernel(get_default_queue(), ndrange, ^{my_func_A(a, b, c);});
...
}
kernel void my_func_C(global int *a, global int *b, global int *c){
ndrange_t ndrange;
// build ndrange information
...
// note that a, b and c are variables in scope of the block
void (^my_block_A)(void) = ^{my_func_A(a, b, c);};
// enqueue the block variable
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange, my_block_A);
...
}
The example below shows how to declare a block literal and enqueue it.
kernel void my_func(global int *a, global int *b) {
ndrange_t ndrange;
// build ndrange information
...
// note that a, b and c are variables in scope of the block
void (^my_block_A)(void) = ^{ size_t id = get_global_id(0); b[id] += a[id]; };
// enqueue the block variable
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange, my_block_A);
// or we could have done the following
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,^{ size_t id = get_global_id(0); b[id] += a[id]; };
}
NOTE: Blocks passed to enqueue_kernel cannot use global variables or stack variables local to the enclosing lexical scope that are a pointer type in the local or private address space.
kernel voidfoo(global int *a, local int *lptr, …){
enqueue_kernel(get_default_queue(),CLK_ENQUEUE_FLAGS_WAIT_KERNEL,ndrange,
^{ size_t id = get_global_id(0);local int *p = lptr; // undefined behavior
};
}