onnxruntime - YingkunZhou/EdgeTransformerBench GitHub Wiki
git clone https://github.com/microsoft/onnxruntime.git --depth=1
cd onnxruntime
git submodule update --init --recursive
well, I prefer to use clang to build
export CC=clang
export CXX=clang++
export CPLUS_INCLUDE_PATH=$HOME/miniforge3/include # for zlib.h header file
export LIBRARY_PATH=$HOME/miniforge3/lib # for libiconv.so
# simple build for cpu without python
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests
# DESTDIR=../install make install -j32 # to get include dir
./build.sh --android --android_sdk_path ../android-sdk --android_ndk_path $ANDROID_NDK --android_abi arm64-v8a --android_api 30 --use_nnapi --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests
注意:这里需要将
include/onnxruntime/core/providers/nnapi/nnapi_provider_factory.h
复制到目标的onnuxruntime include目录下
- Qualcomm® Innovators Development Kit - QIDK
- Qualcomm AI Engine Direct SDK(1) 下载
- how to build Qualcomm AI Engine Direct SDK (Qualcomm Neural Network SDK) Linux/Android/Windows
- QNN Execution Provider


qpm-cli --login [email protected]
# https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/toolsinstallation/70818
qpm-cli --license-activate qualcomm_ai_engine_direct
qpm-cli --extract qualcomm_ai_engine_direct
export ANDROID_NDK=$HOME/work/android-ndk-r22b
./build.sh --build_shared_lib --skip_submodule_sync --android --config Release --use_qnn --qnn_home /opt/qcom --android_sdk_path ../android-sdk --android_ndk_path $ANDROID_NDK --android_abi arm64-v8a --android_api 30 --cmake_generator Ninja --build_dir build/Android
需要
/opt/qcom/aistack/qnn/2.14.2.230905/lib/aarch64-android/libQnnHtp.so
和/opt/qcom/aistack/qnn/2.14.2.230905/lib/aarch64-android/libQnnCpu.so
- compile command
g++ -O3 -o onnxruntime-perf-qnn onnxruntime_perf.cpp -I/data/data/com.termux/files/home/work/onnxruntime-qnn/include -L/data/data/com.termux/files/home/work/onnxruntime-qnn/lib utils.cpp -std=c++17 `pkg-config --cflags --libs opencv4` -lonnxruntime -DQNN -DNNAPI
- execute command demo
LD_LIBRARY_PATH=$HOME/work/onnxruntime-qnn/lib ./onnxruntime-perf-qnn --only-test=res --backend=q # run on qnn
LD_LIBRARY_PATH=$HOME/work/onnxruntime-qnn/lib ./onnxruntime-perf-qnn --only-test=res # run on cpu
LD_LIBRARY_PATH=$HOME/work/onnxruntime-qnn/lib ./onnxruntime-perf-qnn --only-test=res --backend=n # run on nnapi
另外在xiaomi MI8中,无论是qnn还是cpu,都会报非法指令错误:
0x0000007f7725f310 MLAS_PLATFORM::MLAS_PLATFORM()+20 mrs x12, id_aa64isar0_el1
Thread 1 "onnxruntime-per" received signal SIGILL, Illegal instruction.
不知道为什么在8gen2上,采用qnn后端,无论是libQnnCpu.so
还是libQnnHtp.so
都是死用CPU,一点都没有用GPU和NPU的打算?!
This tool can be used to quantize select ONNX models. Support is based on operators in the model. Please refer to https://onnxruntime.ai/docs/performance/quantization.html for usage details and https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization for examples.
-
Operator-oriented (QOperator). All the quantized operators have their own ONNX definitions, like QLinearConv, MatMulInteger and etc.
-
面向张量(QDQ;量化和去量化)。这种格式在原始算子之间插入去量化线性(量化线性(张量)),以模拟量化和去量化过程。在静态量化中,QuantizeLinear 和 DeQuantizeLinear 运算符也带有量化参数。在动态量化中,插入了计算量化参数函数原型以动态计算量化参数。通过以下方式生成的模型采用 QDQ 格式
- Models quantized by quantize_static or quantize_dynamic API, explained below, with
quant_format=QuantFormat.QDQ
. - 从PyTorch导出的量化感知训练(QAT)模型。???(对于后两种情况,无需使用量化工具量化模型。ONNX 运行时可以直接将它们作为量化模型运行。)
- Models quantized by quantize_static or quantize_dynamic API, explained below, with
模型优化执行某些算子融合,使量化工具的工作更容易。例如,在优化过程中,可以将卷积运算符后跟 BatchNormalization 融合为一个,可以非常有效地量化。
预处理API在Python模块 onnxruntime.quantization.shape_inference
、函数 quant_pre_process()
中。请参阅 shape_inference.py
。若要了解可用于预处理的其他选项和更精细的控件,请运行以下命令:
python -m onnxruntime.quantization.shape_inference --help
量化模型有两种方法:动态和静态。
- 动态量化动态计算激活的量化参数(刻度和零点)。这些计算增加了推理的成本,而与静态计算相比,通常可以实现更高的准确性。
用于动态量化的Python API在模块
onnxruntime.quantization.quantize
,函数quantize_dynamic()
中
- 静态量化方法首先使用一组称为校准数据的输入运行模型。在这些运行期间,我们计算每次激活的量化参数。这些量化参数作为常量写入量化模型,并用于所有输入。我们的量化工具支持三种校准方法:最小最大值、熵和百分位数。详情请参阅
calibrate.py
。
动态量化和静态量化之间的主要区别在于如何计算激活的规模和零点。对于静态量化,它们是使用校准数据集预先(离线)计算的。因此,激活在每次正向传递期间具有相同的刻度和零点。对于动态量化,它们是动态计算的(在线),并且特定于每个正向传递。因此,它们更准确,但引入了额外的计算开销。
通常,建议对 RNN 和基于变压器的模型使用动态量化,对 CNN 模型使用静态量化。
NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name 'Conv_0_quant'
但是天有不测风云,onnxruntime的dynamic的量化是专门为transeformer设计的!!!conv跑不通!!!
如果训练后量化方法都无法达到准确性目标,则可以尝试使用量化感知训练 (QAT) 重新训练模型。ONNX 运行时目前不提供重新训练,但可以使用原始框架重新训练模型,并将其转换回 ONNX。
量化值的宽度为 8 位,可以是有符号 (int8) 或无符号 (uint8)。我们可以分别选择激活的符号性和权重,因此数据格式可以是(激活:uint8,权重:uint8),(激活:uint8,权重:int8)等。让我们使用 U8U8 作为 (激活: uint8, weights: uint8), U8S8 for (激活: uint8, weights: int8) 的简写,以及类似 S8U8 和 S8S8 用于其余两种格式。
CPU 上的 ONNX 运行时量化可以运行 U8U8、U8S8 和 S8S8。带 QDQ 的 S8S8 是默认设置,可在性能和准确性之间取得平衡。它应该是首选。只有在精度下降很多的情况下,您才能尝试U8U8。请注意,带有 QOperator 的 S8S8 在 x86-64 CPU 上会很慢,一般应避免使用。GPU 上的 ONNX 运行时量化仅支持 S8S8。
There are specific optimizations for transformer-based models, such as QAttention for quantization of attention layers. In order to leverage these optimizations, you need to optimize your models using the Transformer Model Optimization Tool before quantizing the model.
This notebook demonstrates the process.
CPU FP32/FP16(only for resnet50/LeViT)
Model | Top-1 | Top-1 //20 est. |
Top-1 //50 est. |
#params | GMACs |
---|---|---|---|---|---|
efficientformerv2_s0 | - | 76.3 | 76.0 | 3.5M | 0.40G |
efficientformerv2_s1 | - | 78.8 | 79.6 | 6.1M | 0.65G |
efficientformerv2_s2 | - | 82.1 | 82.0 | 12.6M | 1.25G |
SwiftFormer_XS | - | 76.2 | 75.2 | 3.5M | 0.4G |
SwiftFormer_S | - | 78.4 | 78.2 | 6.1M | 1.0G |
SwiftFormer_L1 | - | 80.6 | 81.8 | 12.1M | 1.6G |
EMO_1M | - | 70.8 | 68.3 | 1.3M | 0.26G |
EMO_2M | - | 74.8 | 73.7 | 2.3M | 0.44G |
EMO_5M | - | 78.2 | 77.6 | 5.1M | 0.90G |
EMO_6M | - | 79.1 | 78.1 | 6.1M | 0.96G |
edgenext_xx_small | - | 70.8 | 70.7 | 1.3M | 0.26G |
edgenext_x_small | - | 74.8 | 74.8 | 2.3M | 0.54G |
edgenext_small/usi | - | 80.6 | 79.9 | 5.6M | 1.26G |
mobilevitv2_050 | - | 69.9 | 66.6 | 1.4M | 0.5G |
mobilevitv2_075 | - | 75.1 | 74.3 | 2.9M | 1.0G |
mobilevitv2_100 | - | 77.9 | 76.9 | 4.9M | 1.8G |
mobilevitv2_125 | - | 79.2 | 80.7 | 7.5M | 2.8G |
mobilevitv2_150 | - | 80.9 | 81.8 | 10.6M | 4.0G |
mobilevitv2_175 | - | 80.7 | 81.0 | 14.3M | 5.5G |
mobilevitv2_200 | - | 82.0 | 83.1 | 18.4M | 7.2G |
mobilevit_xx_small | - | 68.9 | 66.5 | 1.3M | 0.36G |
mobilevit_x_small | - | 74.0 | 73.7 | 2.3M | 0.89G |
mobilevit_small | - | 77.6 | 77.9 | 5.6M | 2.0 G |
LeViT_128S | - | 75.9 | 76.1 | 7.8M | 0.30G |
LeViT_128 | - | 79.4 | 78.1 | 9.2M | 0.41G |
LeViT_192 | - | 79.6 | 79.6 | 11 M | 0.66G |
LeViT_256 | - | 81.1 | 81.4 | 19 M | 1.12G |
resnet50 | - | 79.6 | 81.3 | 25.6M | 4.1G |
mobilenetv3_large_100 | - | 75.6 | 75.3 | 5.5M | 0.29G |
tf_efficientnetv2_b0 | - | 78.2 | 76.7 | 7.1M | 0.72G |
tf_efficientnetv2_b1 | - | 79.4 | 79.2 | 8.1M | 1.2G |
tf_efficientnetv2_b2 | - | 81.7 | 80.4 | 10.1M | 1.7G |
tf_efficientnetv2_b3 | - | 81.8 | 82.3 | 14.4M | 3.0G |
CPU static int8
Model | Top-1 | Top-1 //20 est. |
Top-1 //50 est. |
#params | GMACs |
---|---|---|---|---|---|
efficientformerv2_s0 | - | 10.4 | 6.5 | 3.5M | 0.40G |
efficientformerv2_s1 | - | 17.6 | 14.7 | 6.1M | 0.65G |
efficientformerv2_s2 | - | 23.8 | 20.7 | 12.6M | 1.25G |
SwiftFormer_XS | - | 62.7 | 58.8 | 3.5M | 0.4G |
SwiftFormer_S | - | 35.4 | 32.3 | 6.1M | 1.0G |
SwiftFormer_L1 | - | 63.2 | 55.9 | 12.1M | 1.6G |
EMO_1M | - | 17.8 | 17.0 | 1.3M | 0.26G |
EMO_2M | - | 50.5 | 42.9 | 2.3M | 0.44G |
EMO_5M | - | 7.8 | 2.6 | 5.1M | 0.90G |
EMO_6M | - | 0.1 | 0.0 | 6.1M | 0.96G |
edgenext_xx_small | - | 63.9 | 65.2 | 1.3M | 0.26G |
edgenext_x_small | - | 69.3 | 70.0 | 2.3M | 0.54G |
edgenext_small/usi | - | 58.6 | 54.0 | 5.6M | 1.26G |
mobilevitv2_050 | - | 7.2 | 6.7 | 1.4M | 0.5G |
mobilevitv2_075 | - | 2.2 | 0.9 | 2.9M | 1.0G |
mobilevitv2_100 | - | 6.9 | 3.0 | 4.9M | 1.8G |
mobilevitv2_125 | - | 27.2 | 25.0 | 7.5M | 2.8G |
mobilevitv2_150 | - | 32.9 | 29.7 | 10.6M | 4.0G |
mobilevitv2_175 | - | 18.2 | 14.6 | 14.3M | 5.5G |
mobilevitv2_200 | - | 36.2 | 31.4 | 18.4M | 7.2G |
mobilevit_xx_small | - | 0 | 0 | 1.3M | 0.36G |
mobilevit_x_small | - | 0.1 | 0.2 | 2.3M | 0.89G |
mobilevit_small | - | 3.2 | 6.0 | 5.6M | 2.0 G |
LeViT_128S | - | 43.5 | 41.1 | 7.8M | 0.30G |
LeViT_128 | - | 77.9 | 76.1 | 9.2M | 0.41G |
LeViT_192 | - | 78.0 | 78.5 | 11 M | 0.66G |
LeViT_256 | - | 80.0 | 80.9 | 19 M | 1.12G |
resnet50 | - | 79.5 | 80.8 | 25.6M | 4.1G |
mobilenetv3_large_100 | - | 71.1 | 67.5 | 5.5M | 0.29G |
tf_efficientnetv2_b0 | - | 77.1 | 75.9 | 7.1M | 0.72G |
tf_efficientnetv2_b1 | - | 78.1 | 76.8 | 8.1M | 1.2G |
tf_efficientnetv2_b2 | - | 78.7 | 77.3 | 10.1M | 1.7G |
tf_efficientnetv2_b3 | - | 80.2 | 79.9 | 14.4M | 3.0G |
- EMO
python -m onnxruntime.quantization.preprocess --input .onnx/fp32/EMO_1M.onnx --output .onnx/prep/EMO_1M.onnx
Exception: Incomplete symbolic shape inference
- edgenext
python -m onnxruntime.quantization.preprocess --input .onnx/fp32/edgenext_xx_small.onnx --output .onnx/prep/edgenext_xx_small.onnx
assert cls_type in ["tensor_type", "sequence_type"]
- LeViT
python -m onnxruntime.quantization.preprocess --input .onnx/fp32/LeViT_128S.onnx --output .onnx/prep/LeViT_128S.onnx
assert int(map_to) == int(s)
How do I enable the QNN delegate in Tensorflow Lite? Googling "tensorflow lite QNN delegate" yields nothing.
QNN is a new delegate developed by Qualcomm, it is targeted at the TFLite Hexagon replacement. There obviously exists some connection between the QNN and SNPE, but we cannot say much more than that now.