how to get onnxruntime

git clone https://github.com/microsoft/onnxruntime.git --depth=1
cd onnxruntime
git submodule update --init --recursive

Build ONNX Runtime for inferencing

well, I prefer to use clang to build

export CC=clang
export CXX=clang++
export CPLUS_INCLUDE_PATH=$HOME/miniforge3/include # for zlib.h header file
export LIBRARY_PATH=$HOME/miniforge3/lib # for libiconv.so
# simple build for cpu without python
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests
# DESTDIR=../install make install -j32 # to get include dir

cross build for android

./build.sh --android --android_sdk_path ../android-sdk --android_ndk_path $ANDROID_NDK --android_abi arm64-v8a --android_api 30 --use_nnapi  --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests

with nnapi

注意：这里需要将include/onnxruntime/core/providers/nnapi/nnapi_provider_factory.h复制到目标的onnuxruntime include目录下

with qnn

Qualcomm® Innovators Development Kit - QIDK
Qualcomm AI Engine Direct SDK（1）下载
how to build Qualcomm AI Engine Direct SDK (Qualcomm Neural Network SDK) Linux/Android/Windows
QNN Execution Provider

qpm-cli --login [email protected]
# https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/toolsinstallation/70818
qpm-cli --license-activate qualcomm_ai_engine_direct
qpm-cli --extract qualcomm_ai_engine_direct
export ANDROID_NDK=$HOME/work/android-ndk-r22b
./build.sh --build_shared_lib --skip_submodule_sync --android --config Release --use_qnn --qnn_home /opt/qcom --android_sdk_path ../android-sdk --android_ndk_path $ANDROID_NDK --android_abi arm64-v8a --android_api 30 --cmake_generator Ninja --build_dir build/Android

需要 /opt/qcom/aistack/qnn/2.14.2.230905/lib/aarch64-android/libQnnHtp.so 和 /opt/qcom/aistack/qnn/2.14.2.230905/lib/aarch64-android/libQnnCpu.so

compile command

g++ -O3 -o onnxruntime-perf-qnn onnxruntime_perf.cpp -I/data/data/com.termux/files/home/work/onnxruntime-qnn/include  -L/data/data/com.termux/files/home/work/onnxruntime-qnn/lib utils.cpp  -std=c++17 `pkg-config --cflags --libs opencv4` -lonnxruntime -DQNN -DNNAPI

execute command demo

LD_LIBRARY_PATH=$HOME/work/onnxruntime-qnn/lib ./onnxruntime-perf-qnn --only-test=res --backend=q # run on qnn
LD_LIBRARY_PATH=$HOME/work/onnxruntime-qnn/lib ./onnxruntime-perf-qnn --only-test=res # run on cpu
LD_LIBRARY_PATH=$HOME/work/onnxruntime-qnn/lib ./onnxruntime-perf-qnn --only-test=res --backend=n # run on nnapi

另外在xiaomi MI8中，无论是qnn还是cpu，都会报非法指令错误： 0x0000007f7725f310 MLAS_PLATFORM::MLAS_PLATFORM()+20 mrs x12, id_aa64isar0_el1 Thread 1 "onnxruntime-per" received signal SIGILL, Illegal instruction.

不知道为什么在8gen2上，采用qnn后端，无论是libQnnCpu.so还是libQnnHtp.so都是死用CPU，一点都没有用GPU和NPU的打算？！

precision

reference

This tool can be used to quantize select ONNX models. Support is based on operators in the model. Please refer to https://onnxruntime.ai/docs/performance/quantization.html for usage details and https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization for examples.

Quantize ONNX Models
quantize.py
Operator-oriented (QOperator). All the quantized operators have their own ONNX definitions, like QLinearConv, MatMulInteger and etc.
面向张量（QDQ;量化和去量化）。这种格式在原始算子之间插入去量化线性（量化线性（张量）），以模拟量化和去量化过程。在静态量化中，QuantizeLinear 和 DeQuantizeLinear 运算符也带有量化参数。在动态量化中，插入了计算量化参数函数原型以动态计算量化参数。通过以下方式生成的模型采用 QDQ 格式
- Models quantized by quantize_static or quantize_dynamic API, explained below, with quant_format=QuantFormat.QDQ.
- 从PyTorch导出的量化感知训练（QAT）模型。？？？（对于后两种情况，无需使用量化工具量化模型。ONNX 运行时可以直接将它们作为量化模型运行。）

模型优化执行某些算子融合，使量化工具的工作更容易。例如，在优化过程中，可以将卷积运算符后跟 BatchNormalization 融合为一个，可以非常有效地量化。

预处理API在Python模块 onnxruntime.quantization.shape_inference 、函数 quant_pre_process() 中。请参阅 shape_inference.py 。若要了解可用于预处理的其他选项和更精细的控件，请运行以下命令：

python -m onnxruntime.quantization.shape_inference --help

量化模型有两种方法：动态和静态。

动态量化动态计算激活的量化参数（刻度和零点）。这些计算增加了推理的成本，而与静态计算相比，通常可以实现更高的准确性。

用于动态量化的Python API在模块 onnxruntime.quantization.quantize ，函数 quantize_dynamic() 中

静态量化方法首先使用一组称为校准数据的输入运行模型。在这些运行期间，我们计算每次激活的量化参数。这些量化参数作为常量写入量化模型，并用于所有输入。我们的量化工具支持三种校准方法：最小最大值、熵和百分位数。详情请参阅 calibrate.py 。

动态量化和静态量化之间的主要区别在于如何计算激活的规模和零点。对于静态量化，它们是使用校准数据集预先（离线）计算的。因此，激活在每次正向传递期间具有相同的刻度和零点。对于动态量化，它们是动态计算的（在线），并且特定于每个正向传递。因此，它们更准确，但引入了额外的计算开销。

通常，建议对 RNN 和基于变压器的模型使用动态量化，对 CNN 模型使用静态量化。

NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name 'Conv_0_quant'

但是天有不测风云，onnxruntime的dynamic的量化是专门为transeformer设计的！！！conv跑不通！！！

如果训练后量化方法都无法达到准确性目标，则可以尝试使用量化感知训练（QAT）重新训练模型。ONNX 运行时目前不提供重新训练，但可以使用原始框架重新训练模型，并将其转换回 ONNX。

量化值的宽度为 8 位，可以是有符号（int8）或无符号（uint8）。我们可以分别选择激活的符号性和权重，因此数据格式可以是（激活：uint8，权重：uint8），（激活：uint8，权重：int8）等。让我们使用 U8U8 作为（激活： uint8， weights： uint8）， U8S8 for （激活： uint8， weights： int8）的简写，以及类似 S8U8 和 S8S8 用于其余两种格式。

CPU 上的 ONNX 运行时量化可以运行 U8U8、U8S8 和 S8S8。带 QDQ 的 S8S8 是默认设置，可在性能和准确性之间取得平衡。它应该是首选。只有在精度下降很多的情况下，您才能尝试U8U8。请注意，带有 QOperator 的 S8S8 在 x86-64 CPU 上会很慢，一般应避免使用。GPU 上的 ONNX 运行时量化仅支持 S8S8。

Transformer-based models

There are specific optimizations for transformer-based models, such as QAttention for quantization of attention layers. In order to leverage these optimizations, you need to optimize your models using the Transformer Model Optimization Tool before quantizing the model.

This notebook demonstrates the process.

result

CPU FP32/FP16(only for resnet50/LeViT)

Model	Top-1	Top-1 //20 est.	Top-1 //50 est.	#params	GMACs
efficientformerv2_s0	-	76.3	76.0	3.5M	0.40G
efficientformerv2_s1	-	78.8	79.6	6.1M	0.65G
efficientformerv2_s2	-	82.1	82.0	12.6M	1.25G

SwiftFormer_XS	-	76.2	75.2	3.5M	0.4G
SwiftFormer_S	-	78.4	78.2	6.1M	1.0G
SwiftFormer_L1	-	80.6	81.8	12.1M	1.6G

EMO_1M	-	70.8	68.3	1.3M	0.26G
EMO_2M	-	74.8	73.7	2.3M	0.44G
EMO_5M	-	78.2	77.6	5.1M	0.90G
EMO_6M	-	79.1	78.1	6.1M	0.96G

edgenext_xx_small	-	70.8	70.7	1.3M	0.26G
edgenext_x_small	-	74.8	74.8	2.3M	0.54G
edgenext_small/usi	-	80.6	79.9	5.6M	1.26G

mobilevitv2_050	-	69.9	66.6	1.4M	0.5G
mobilevitv2_075	-	75.1	74.3	2.9M	1.0G
mobilevitv2_100	-	77.9	76.9	4.9M	1.8G
mobilevitv2_125	-	79.2	80.7	7.5M	2.8G
mobilevitv2_150	-	80.9	81.8	10.6M	4.0G
mobilevitv2_175	-	80.7	81.0	14.3M	5.5G
mobilevitv2_200	-	82.0	83.1	18.4M	7.2G

mobilevit_xx_small	-	68.9	66.5	1.3M	0.36G
mobilevit_x_small	-	74.0	73.7	2.3M	0.89G
mobilevit_small	-	77.6	77.9	5.6M	2.0 G

LeViT_128S	-	75.9	76.1	7.8M	0.30G
LeViT_128	-	79.4	78.1	9.2M	0.41G
LeViT_192	-	79.6	79.6	11 M	0.66G
LeViT_256	-	81.1	81.4	19 M	1.12G

resnet50	-	79.6	81.3	25.6M	4.1G

mobilenetv3_large_100	-	75.6	75.3	5.5M	0.29G
tf_efficientnetv2_b0	-	78.2	76.7	7.1M	0.72G
tf_efficientnetv2_b1	-	79.4	79.2	8.1M	1.2G
tf_efficientnetv2_b2	-	81.7	80.4	10.1M	1.7G
tf_efficientnetv2_b3	-	81.8	82.3	14.4M	3.0G

CPU static int8

Model	Top-1	Top-1 //20 est.	Top-1 //50 est.	#params	GMACs
efficientformerv2_s0	-	10.4	6.5	3.5M	0.40G
efficientformerv2_s1	-	17.6	14.7	6.1M	0.65G
efficientformerv2_s2	-	23.8	20.7	12.6M	1.25G

SwiftFormer_XS	-	62.7	58.8	3.5M	0.4G
SwiftFormer_S	-	35.4	32.3	6.1M	1.0G
SwiftFormer_L1	-	63.2	55.9	12.1M	1.6G

EMO_1M	-	17.8	17.0	1.3M	0.26G
EMO_2M	-	50.5	42.9	2.3M	0.44G
EMO_5M	-	7.8	2.6	5.1M	0.90G
EMO_6M	-	0.1	0.0	6.1M	0.96G

edgenext_xx_small	-	63.9	65.2	1.3M	0.26G
edgenext_x_small	-	69.3	70.0	2.3M	0.54G
edgenext_small/usi	-	58.6	54.0	5.6M	1.26G

mobilevitv2_050	-	7.2	6.7	1.4M	0.5G
mobilevitv2_075	-	2.2	0.9	2.9M	1.0G
mobilevitv2_100	-	6.9	3.0	4.9M	1.8G
mobilevitv2_125	-	27.2	25.0	7.5M	2.8G
mobilevitv2_150	-	32.9	29.7	10.6M	4.0G
mobilevitv2_175	-	18.2	14.6	14.3M	5.5G
mobilevitv2_200	-	36.2	31.4	18.4M	7.2G

mobilevit_xx_small	-	0	0	1.3M	0.36G
mobilevit_x_small	-	0.1	0.2	2.3M	0.89G
mobilevit_small	-	3.2	6.0	5.6M	2.0 G

LeViT_128S	-	43.5	41.1	7.8M	0.30G
LeViT_128	-	77.9	76.1	9.2M	0.41G
LeViT_192	-	78.0	78.5	11 M	0.66G
LeViT_256	-	80.0	80.9	19 M	1.12G

resnet50	-	79.5	80.8	25.6M	4.1G

mobilenetv3_large_100	-	71.1	67.5	5.5M	0.29G
tf_efficientnetv2_b0	-	77.1	75.9	7.1M	0.72G
tf_efficientnetv2_b1	-	78.1	76.8	8.1M	1.2G
tf_efficientnetv2_b2	-	78.7	77.3	10.1M	1.7G
tf_efficientnetv2_b3	-	80.2	79.9	14.4M	3.0G

python -m onnxruntime.quantization.preprocess --input .onnx/fp32/EMO_1M.onnx --output .onnx/prep/EMO_1M.onnx
Exception: Incomplete symbolic shape inference

edgenext

python -m onnxruntime.quantization.preprocess --input .onnx/fp32/edgenext_xx_small.onnx --output .onnx/prep/edgenext_xx_small.onnx
assert cls_type in ["tensor_type", "sequence_type"]

LeViT

python -m onnxruntime.quantization.preprocess --input .onnx/fp32/LeViT_128S.onnx --output .onnx/prep/LeViT_128S.onnx
assert int(map_to) == int(s)

misc

What are Qualcomm QNN HTP/DSP Delegates?

How do I enable the QNN delegate in Tensorflow Lite? Googling "tensorflow lite QNN delegate" yields nothing.

QNN is a new delegate developed by Qualcomm, it is targeted at the TFLite Hexagon replacement. There obviously exists some connection between the QNN and SNPE, but we cannot say much more than that now.

onnxruntime - YingkunZhou/EdgeTransformerBench GitHub Wiki

how to get onnxruntime

Build ONNX Runtime for inferencing

cross build for android

with nnapi

with qnn

precision

reference

Transformer-based models

result

misc

⚠️ GitHub.com Fallback ⚠️

onnxruntime - YingkunZhou/EdgeTransformerBench GitHub Wiki

how to get onnxruntime

Build ONNX Runtime for inferencing

cross build for android

with nnapi

with qnn

precision

reference

Transformer-based models

result

misc

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️