OE 27. Wide Universal Intrinsics - alalek/opencv GitHub Wiki
Expand universal intrinsics to cover AVX-2, AVX-512 etc.
- Author: Vadim Pisarevsky
- Link: The feature request
- Status: Draft
- Platforms: Intel (and maybe other platforms with SIMD registers wider than 128 bits)
- Complexity: A few man-weeks
Introduction and Rationale
Currently OpenCV includes very convenient universal intrinsics that cover different 128-bit SIMD extensions on different platforms, such as SSE2
(or higher) on IA, NEON
on ARM and VSX
on PPC64. The intrinsics implement the concept "write once - run everywhere". In addition to that, they do not require runtime dispatching, because those basic instruction sets (e.g. SSE2
on IA) are considered as "always-available" feature.
For AVX2
and AVX-512
dispatching on Intel we currently use dynamic dispatcher, which is very convenient for users (no need to have a separate build of OpenCV for each platform). At the same time this technology ("one binary fits all") does not provide the best performance and the number of those dynamically dispatched code branches is rather small (and those branches need special hardware to test).
So, it would be nice to be able to build specialized versions of OpenCV where AVX2
or AVX-512
(or other similar instruction set) is enabled by default and correspondingly all the universal intrinsics are expanded to those actual intrinsics instead of the baseline SSE2
etc.
Proposed solution
- It's suggested to rename all the vector types used in universal intrinsics to something size-agnostic. Currently we use
v_uint8x16
etc. types for the vector types. They can be renamed tov_uint8xn
(or simply tov_uint8
). - The intrinsics themselves can stay the same.
- There are already
v_uint8x16::nlanes
etc. enumeration constants, they just need to be defined properly, depending on the actual data type, e.g.v_uint8xn::nlanes == 16
in the case ofSSE2
andv_uint8xn::nlanes == 32
in the case ofAVX-2
. - The vectorized loops should be modified accordingly to increment the pointers by this
...::nlanes
instead of literal16
etc. - Those expanded universal intrinsics can be used together with the dynamic dispatcher, just like before - the actual expansion of universal intrinsics is defined by the
baseline
instruction set. For example, OpenCV can be built as AVX-2 library, then the universal intrinsics will be expanded as AVX-2 intrinsics. At the same time, AVX-512 branches can be dispatched dynamically if the actual hardware supports it.
Impact on existing code, compatibility
Some people may already use those SIMD128 universal intrinsics. For them we could retain the previous defines as aliases (and report a compile-time error if they are used with improper default instruction set, e.g. -mavx2
etc.)
Possible alternatives
- leave things as-is
- mini-Halide, embedded into OpenCV, can be a viable alternative for regular kernels that fit Halide. For complex custom loops that need to be optimized manually Halide solution will not work.