ARM NEON Programmer’s Guide Reading Notes - yszheda/wiki GitHub Wiki
- • Short, simple loops work best (even if it means multiple loops in your code).
- • Avoid using a break statement to exit a loop.
- • Try to make the number of iterations a power of two.
- • Try to make sure the number of iterations is known to the compiler.
- • Functions called inside a loop should be inlined.
- • Using arrays with indexing vectorizes better than using pointers.
- • Indirect addressing (multiple indexing or dereferencing) does not vectorize.
- • Use the restrict keyword to tell the compiler that pointers do not reference overlapping areas of memory.
- NEON does not support double-precision floating-point.
- NEON technology supports 64-bit integers only for certain operations, so avoid using long long variables where possible.
- NEON technology includes a group of instructions that can perform structured load and store operations. These instructions can only be used for vectorized access to data structures where all members are of the same size.
Some floating-point operations are not vectorized by default because vectorizing can change the order of the operations. If the algorithm does not require this level of precision, use:
-
armcc
:--fpmode=fast
-
gcc
:-ffast-math
The NEON unit always operates in Flush-to-Zero mode, making it non-compliant with IEEE 754.
--fpmode=std
--fpmode=ieee_full
for (...) { outbuffer[i].r = ...; }
for (...) { outbuffer[i].g = ...; }
for (...) { outbuffer[i].b = ...; }
// Prefered way
for (...) {
outbuffer[i].r = ...;
outbuffer[i].g = ...;
outbuffer[i].b = ...;
}
#pragma unroll (n)
- The NEON load instructions can load unaligned structures.
- NEON structure load instructions require that all items in the structure are the same length.
-
VHADD
can be used to calculate the mean of two inputs.
- The NEON intrinsic to de-interleave is
vld<n>_<datatype>
where n represents the interleave pattern and can be 2, 3, or 4.
#include <arm_neon.h>
int main (void)
{
// This represents 3 vectors.
// Each vector has eight lanes of 8-bit data.
uint8x8x3_t v;
unsigned char A[24]; // This array represents a 24-bit RGB image.
// This de-interleaves the 24-bit image from array A
// and stores them in 3 separate vectors
v = vld3_u8(A);
// v.val[0] is the first vector in V. It is for the red channel
// v.val[1] is the second vector in V. It is for the green channel
// v.val[2] is the third vector in V. It is for the blue channel.
// Double the red channel
v.val[0] = vadd_u8(v.val[0],v.val[0]);
vst3_u8(A, v); // store the vector back into the array, with the red channel doubled.
return 0;
}
Use the
VMOV
instruction to pass data from NEON registers to ARM registers. However, this is slow especially on Cortex-A8.
If it is possible to change the size of the input arrays, then increase the length of the array to the next multiple of the vector size using padding elements. This allows the NEON instruction to read and write beyond the end of the input array without corrupting adjacent storage.
- Overlapping can be used only when the operation is idempotent. This means that the value must not change depending on how many times the operation is applied to the input data.
- The overlapping method can be used only if the number of elements in the input array is more than the vector length.
// TODO
- Storing to the same area of memory with both ARM and NEON instructions can reduce performance. This is because writes from the ARM pipeline are delayed until writes from the NEON pipeline have completed.
- Avoid writing to the same area of memory, specifically the same cache line, from both ARM and NEON code.