XNNPACK - AshokBhat/ml GitHub Wiki

About

  • Highly optimized library of floating-point neural network inference operators
  • For ARM, WebAssembly, and x86 platforms
  • Provides low-level performance primitives to accelerate high-level machine learning frameworks
  • Usage - TensorFlow Lite, TensorFlow.js, PyTorch Mobile, Alibaba HALO, Samsung ONE
  • Based on QNNPACK

Developer's view (Marat's view)

XNNPACK is the inference engine for CPU.

CPU is the default backend in TensorFlow Lite, and CPU inference always works and produces the correct result.

GPU/DSP/NPU inference can be faster, particularly for large models on high-end SoCs, but generally, you need to make sure that the model is supported on the IP block, the result is correct and performance is better than the CPU baseline.

And that quickly gets very complicated:

  1. NN API, and TFLite GPU/DSP backends support a limited subset of all TensorFlow Lite operators, and if a model is only partially offloaded to GPU/DSP/NPU, part of it will still run on CPU, and commonly synchronization overhead kills all potential speedups of the specialized hardware. The situation is even worse in CoreML, as CoreML doesn't provide an API to even learn which operators failed to offload to GPU/NPU.
  1. Bugs in GPU shader compilers and NN API drivers do happen, and unless your model is a standard MobileNet, you're likely to hit them at least on some mobile phones. Then you'd need an infrastructure to detect this situation and disable offloading the model to this IP block on particular phones.
  1. Low-end SoCs usually completely lack DSP and NPU, and their GPU is often slower than CPU even in nominal peak performance. This happens because CPU cores in low-end SoCs are typically just downclocked versions of the CPU cores in high-end SoCs, but low-end GPUs have 8-16 times fewer GPU cores than their high-end counterparts.

Primary developers

Data type support

  • FP32 - Yes
  • FP16 - Work-in-progress
  • Quantized operators - QS8 - Work-in-progress - Focus area
  • Quantized operators - QU8 - Work-in-progress - Not as efficient as QNNPACK

Repo

Usage

TensorFlow Lite Integration

  • All operators were optimized for ARM NEON.
  • Critical operators (convolution, depthwise convolution, transposed convolution, fully-connected) handcoded
    • For commonly-used ARM cores in mobile phones
    • Cortex-A53/A73 in Pixel 2
    • Cortex-A55/A75 in Pixel 3
  • Operator fusion -Looks at the whole computational graph and optimizes it through operator fusion

See also