SpeedComparisons - GollyGang/ready GitHub Wiki

This page will contain the results from the different implementations on different systems, allowing everyone to see how they compare.

Results

To standardise the results we have chosen the Gray-Scott system, with toroidal topology. The figures are in Million cell-generations per second.

System 1 System 2 System 3 3b System 4 System 5 System 6 others (please add)
GrayScott 84 117 67 126 14 9
GrayScott_double 87 115 69 127 10 9
GrayScott_OpenCV 97 117
GrayScott_OpenMP 250 163 98 250 10 8
GrayScott_SSE 540 450 315 490 18 27
GrayScott_SSE_OpenMP 1000 160-2301 64-791 230-7281 23 24
GrayScott_HWIVector (1 thread) 540 455 285 478 26 27
HWIVector (2 threads) 950 750 478 838 23 27
HWIVector (all threads) 1710 825 675 1268
GrayScott_OpenCL 3400 137 201 27 1760 - -
GrayScott_OpenCL_Local 3380 54 76 37 1400 - -
GrayScott_OpenCL_2x2 2880 339 423 112 1860 - -
GrayScott_OpenCL_Image 2300 note2 note2 324 note2 - -
GrayScott_OpenCL_Image_2x2 4400 note2 note2 437 note2 - -

System 1: Visual Studio 2008 (32-bit), Windows 7 (64-bit), Intel i7-2600 (4 cores, 8 threads) @ 3.4GHz, nVidia GeForce GTX 460 (962MB global memory, 48 KB local memory, local memory type: local) (Tim's desktop)

System 2: CMake version 2.8.5, GCC 4.2.1, Mac OS 10.6.8 (64-bit), Intel Core i7-2720QM (4 cores, 8 threads) @ 2.2 GHz, AMD Radeon HD 6750 QM with 1024 MB VRAM (Robert's mobile)

System 3: CMake version 2.8.5, GCC 4.2.1, Mac OS 10.6.8 (64-bit), Dual Intel Xeon E5520 (8 cores, 16 threads) @ 2.27 GHz, ATI Radeon HD 5770 with 1024 MB VRAM (Robert's desktop)

System 3b is the same as System 3 but on its second graphics card: NVidia GeForce GT 120 with 512 MB VRAM

System 4: CMake version 2.8.5, GCC 4.2.1, Mac OS 10.6.8 (64-bit), Intel Core i5 (4 cores, 4 threads) @ 3.1 GHz, AMD Radeon HD 6970M with 1024 MB VRAM (Andrew)

System 5: GCC 4.6.1, Debian Linux (32-bit), Intel Pentium 4 HT@ 3.2GHz, AMD Radeon Mobility 9100 IGP (Tim's old laptop)

System 6: Visual Studio 2008, Windows XP (32-bit), Intel Pentium 4 HT@ 3.2GHz, AMD Radeon Mobility 9100 IGP (Tim's old laptop)

Notes:

1. Speeds up as the dots fill the screen (a known issue on Intel processors with denormal values)

2. Does not run, OS and/or GPU driver does not support a needed OpenCL feature.

Issues:

  • Not all systems currently implement the toroidal topology.
  • The grid size affects the speed. 256x256 isn't big enough for some implementations - e.g. at 1024x1024 with _OpenCL_Image_2x2 I get 4200 fps, the equivalent of 67000 fps on a 256x256 grid, but only 36000 fps on an actual 256x256 grid

History

Our first implementation (GrayScott) suffered from a variable speed. Different patterns would run at different speeds, despite the fact that nothing in the code depended on the values on the floats being manipulated.

We tried using doubles instead of floats (GrayScott_double) and this helped enormously. There was still a speed change but it was much less.

Eventually we worked out that we had the problem of denormals - float values that through repeated division were becoming smaller and smaller in the empty spaces where no Gray-Scott spots were found. Adding a tiny constant got rid of this problem, giving us a speed of 84 Mcgs on System 1. Now the float and the double versions run at about the same speed.

The most obvious way to speed up reaction-diffusion (RD) code is to parallelise it. Using OpenMP resulted in a 3x speedup on System 1. Not brilliant but it's very simple to add a single #pragma line before a for loop, see GrayScott_OpenMP.

OpenCL is perhaps the most promising way to get RD to run fast, using the many cores on a graphics card. Our initial implementation, GrayScott_OpenCL gave a 10x speedup over the single-core version on System 1. (Initially we thought it was more, because of the denormals issue.)

But a couple of people (Tom and Robert) suggested exploring SSE first, rather than jumping straight to OpenCL. This turned out to be good advice, our GrayScott_SSE version is 6x faster than the base version on System 1, still running on a single core. Putting OpenMP on top took us to 12x faster than the base version - faster than our OpenCL implementation!

In GrayScott_HWIVector we have wrapped all the SSE calls in a set of macros, to allow it to run on CPUs that don't support SSE, through emulation. We also wrap the threading, to allow it to be used on different platforms. Together these give great performance, 1.7x that of GrayScott_SSE_OpenMP on System 1.

Returning to OpenCL, the main advice for optimisation is to look at cache hits - how to ensure that the data you want is available in the fastest memory. Our GrayScott_OpenCL version uses NDRange local(8,8) which gives a big speed improvement over local(1,1), presumably because it then makes a local cache. We're still learning about these issues so all help is welcome! We tried manually caching data but this didn't help.

There's an image2d_t structure in OpenCL that is already optimized for 2D caching. We found that GrayScott_OpenCL_Image was 3x faster than our float*-based GrayScott_OpenCL on System 1.

Our final optimization to date uses OpenCL's own version of SSE-like processing, using float4's to operate on 4 values at once, both for the maths and the reading and writing. GrayScott_OpenCL_Image_2x2 gave another 2x speedup on top of the image version.

Outside help

Tim started a thread at Khronos.org about this: http://www.khronos.org/message_boards/viewtopic.php?f=28&t=4455

⚠️ **GitHub.com Fallback** ⚠️