ARM NEON Programmer’s Guide Reading Notes - yszheda/wiki GitHub Wiki

Chap.1 Introduction

1.4 Fundamentals of NEON technology

1.4.1 Registers, vectors, lanes and elements

Chap.2 Compiling NEON Instructions

2.1.10 Optimizing for vectorization

  • • Short, simple loops work best (even if it means multiple loops in your code).
  • • Avoid using a break statement to exit a loop.
  • • Try to make the number of iterations a power of two.
  • • Try to make sure the number of iterations is known to the compiler.
  • • Functions called inside a loop should be inlined.
  • • Using arrays with indexing vectorizes better than using pointers.
  • • Indirect addressing (multiple indexing or dereferencing) does not vectorize.
  • • Use the restrict keyword to tell the compiler that pointers do not reference overlapping areas of memory.

Indicate knowledge of number of loop iterations

Avoid loop-carried dependencies

Use the restrict keyword

Avoid conditions inside loops

Use suitable data types

  • NEON does not support double-precision floating-point.
  • NEON technology supports 64-bit integers only for certain operations, so avoid using long long variables where possible.
  • NEON technology includes a group of instructions that can perform structured load and store operations. These instructions can only be used for vectorized access to data structures where all members are of the same size.

Floating-point vectorization

Some floating-point operations are not vectorized by default because vectorizing can change the order of the operations. If the algorithm does not require this level of precision, use:

  • armcc: --fpmode=fast
  • gcc: -ffast-math

The NEON unit always operates in Flush-to-Zero mode, making it non-compliant with IEEE 754.

  • --fpmode=std
  • --fpmode=ieee_full

2.8 Writing code to imply SIMD

Writing loops to imply SIMD

for (...) { outbuffer[i].r = ...; }
for (...) { outbuffer[i].g = ...; }
for (...) { outbuffer[i].b = ...; }


// Prefered way
for (...) {
    outbuffer[i].r = ...;
    outbuffer[i].g = ...;
    outbuffer[i].b = ...;
}

Tell the compiler where to unroll inner loops

#pragma unroll (n)

Write structures to imply SIMD

  • The NEON load instructions can load unaligned structures.
  • NEON structure load instructions require that all items in the structure are the same length.

Chap.3 NEON Instruction Set Architecture

3.2 Instruction syntax

  • VHADD can be used to calculate the mean of two inputs.

Chap.4 NEON Intrinsics

4.9 Constructing multiple vectors from interleaved memory

  • The NEON intrinsic to de-interleave is vld<n>_<datatype> where n represents the interleave pattern and can be 2, 3, or 4.
#include <arm_neon.h>
int main (void)
{
    // This represents 3 vectors.
    // Each vector has eight lanes of 8-bit data.
    uint8x8x3_t v; 

    unsigned char A[24]; // This array represents a 24-bit RGB image.

    // This de-interleaves the 24-bit image from array A
    // and stores them in 3 separate vectors
    v = vld3_u8(A); 

    // v.val[0] is the first vector in V. It is for the red channel
    // v.val[1] is the second vector in V. It is for the green channel
    // v.val[2] is the third vector in V. It is for the blue channel.
    // Double the red channel
    v.val[0] = vadd_u8(v.val[0],v.val[0]);

    vst3_u8(A, v); // store the vector back into the array, with the red channel doubled.

    return 0;
}

Chap. 5 Optimizing NEON Code

5.2 Scheduling

5.2.1 NEON instruction scheduling

5.2.2 Mixed ARM and NEON instruction sequences

5.2.3 Passing data between ARM general-purpose registers and NEON registers

Use the VMOV instruction to pass data from NEON registers to ARM registers. However, this is slow especially on Cortex-A8.

5.2.4 Dual issue for NEON instructions

5.2.6 Optimizations by variable spreading

5.2.7 Optimizations when using lengthening instructions

Chap.6 NEON Code Examples with Intrinsics

6.2 Handling non-multiple array lengths

6.2.3 Larger arrays

If it is possible to change the size of the input arrays, then increase the length of the array to the next multiple of the vector size using padding elements. This allows the NEON instruction to read and write beyond the end of the input array without corrupting adjacent storage.

6.2.4 Overlapping

  • Overlapping can be used only when the operation is idempotent. This means that the value must not change depending on how many times the operation is applied to the input data.
  • The overlapping method can be used only if the number of elements in the input array is more than the vector length.

// TODO

6.2.5 Single element processing

6.2.6 Alignment

6.2.7 Using ARM instructions

  • Storing to the same area of memory with both ARM and NEON instructions can reduce performance. This is because writes from the ARM pipeline are delayed until writes from the NEON pipeline have completed.
  • Avoid writing to the same area of memory, specifically the same cache line, from both ARM and NEON code.

Chap.7 NEON Code Examples with Mixed Operations

⚠️ **GitHub.com Fallback** ⚠️