TODO:

TODO: gcc vs clang
ATen parallel backend -- OMP or NATIVE?
-DUSE_NNAPI=ON -DBUILD_LITE_INTERPRETER=ON ?
https://pytorch.org/cppdocs/
[1] id 25104 name speed_benchmark from 0x0000007ff3c216dc in xnn_f32_gemm_minmax_ukernel_6x8.asm_aarch64_neonfma_prfm_cortex_a75 at ~/pytorch/third_party/XNNPACK/src/f32-gemm/gen/f32-gemm-6x8-minmax-asm-aarch64-neonfma-prfm-cortex-a75.S:435 明明是在a76上运行的🤔️

Quick Start!

# conda create -n py3.8 python=3.8 pip ipython
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

reference

This API is in beta and may change in the near future.

pytorch mobile c++ minimal demo

reference

https://pytorch.org/mobile/android/

prepare the model file

import torch
import torchvision
from torch.utils.mobile_optimizer import optimize_for_mobile

model = torchvision.models.mobilenet_v2(pretrained=True)
model.eval()
example = torch.rand(1, 3, 224, 224)
traced_script_module = torch.jit.trace(model, example)
torchscript_model_optimized = optimize_for_mobile(traced_script_module)
torchscript_model_optimized._save_for_lite_interpreter("model.pt", _use_flatbuffer=True)

https://towardsdatascience.com/deep-learning-on-your-phone-pytorch-lite-interpreter-for-mobile-platforms-ae73d0b17eaa

How is a lite interpreter model different from a TorchScript model?

A lite interpreter model file is a regular TorchScript model file with mobile specific bytecode added to it. You can always load a mobile model as a normal PyTorch TorchScript model, and you can also load it as a lite-interpreter model.

When saved for lite-interpreter (mobile platforms), PyTorch saves additional bytecode for the model’s graph, which is more efficient to execute on device compared to TorchScript. PyTorch also uses lesser binary size in the compiled app relative to running TorchScript.

reference

test/mobile/custom_build/predictor.cpp
binaries/speed_benchmark_torch.cc
https://github.com/ljk53/pytorch-android-cpp-demo/blob/master/predictor.cpp

use C++ API to load model

// demo.cpp
#include <torch/csrc/jit/mobile/import.h>
#include <torch/script.h>
#include <iostream>

int main(int argc, const char* argv[]) {
#if BUILD_LITE_INTERPRETER
  auto module = torch::jit::_load_for_mobile(FLAGS_model);
#else
  // Disable graph optimizer to ensure list of unused ops are not changed for
  // custom mobile build.
  torch::jit::GraphOptimizerEnabledGuard no_optimizer_guard(false);
  auto module = torch::jit::load(argv[1]);
#endif
  std::cout << "ok\n";
}

reference

compile and execute

g++ demo.cpp \
  -I$PWD/pytorch/include \
  -L$PWD/pytorch/lib \
  -lc10 -ltorch_cpu -ltorch

LD_LIBRARY_PATH=$PWD/pytorch/lib ./a.out model.ptl

how to build

firstly, pytorch build dependent on glslc

git clone https://github.com/google/shaderc --depth=1
cd shaderc
./utils/git-sync-deps
# git clone https://github.com/KhronosGroup/glslang.git third_party/glslang
## https://github.com/KhronosGroup/glslang
cd third_party/glslang
git checkout 0c400f67fcf305869c5fb113dd296eca266c9725
cd ../..
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="$(pwd)/install" ..
make install -j32

build pytorch

# clang编译的性能比GCC性能更高！！but why？！
# 从conda下载的应该是clang编译的，而从Nvidia上下载的jetson orin定制版pytorch应该是用gcc编译的
export CC=clang-14
export CXX=clang++-14
conda create -n pytorch python=3.8 pip
conda activate pytorch
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# don't conda install gcc/clang, use system gcc/g++ because we need vulkan SDK which installed in system jetson orin
export PATH=$HOME/work/shaderc/build/install/bin:$PATH
python3 setup.py clean
pip install numpy # to enable USE_NUMPY by default
BUILD_BINARY=ON USE_CUDA=0 USE_VULKAN=1 python setup.py bdist_wheel

编译参数dump如下

--   TORCH_VERSION         : 2.2.0
--   BUILD_CAFFE2          : OFF
--   BUILD_CAFFE2_OPS      : OFF
--   BUILD_STATIC_RUNTIME_BENCHMARK: OFF
--   BUILD_TENSOREXPR_BENCHMARK: OFF
--   BUILD_NVFUSER_BENCHMARK: OFF
--   BUILD_BINARY          : OFF
--   BUILD_CUSTOM_PROTOBUF : ON
--     Link local protobuf : ON
--   BUILD_DOCS            : OFF
--   BUILD_PYTHON          : True
--     Python version      : 3.8.10
--     Python executable   : /usr/bin/python
--     Pythonlibs version  : 3.8.10
--     Python library      : /usr/lib/libpython3.8.so.1.0
--     Python includes     : /usr/include/python3.8
--     Python site-packages: lib/python3.8/site-packages
--   BUILD_SHARED_LIBS     : ON
--   CAFFE2_USE_MSVC_STATIC_RUNTIME     : OFF
--   BUILD_TEST            : True
--   BUILD_JNI             : OFF
--   BUILD_MOBILE_AUTOGRAD : OFF
--   BUILD_LITE_INTERPRETER: OFF
--   INTERN_BUILD_MOBILE   : 
--   TRACING_BASED         : OFF
--   USE_BLAS              : 1
--     BLAS                : open
--     BLAS_HAS_SBGEMM     : 
--   USE_LAPACK            : 1
--     LAPACK              : open
--   USE_ASAN              : OFF
--   USE_TSAN              : OFF
--   USE_CPP_CODE_COVERAGE : OFF
--   USE_CUDA              : 0
--   USE_ROCM              : OFF
--   BUILD_NVFUSER         : OFF
--   USE_EIGEN_FOR_BLAS    : ON
--   USE_FBGEMM            : OFF
--     USE_FAKELOWP          : OFF
--   USE_KINETO            : ON
--   USE_FFMPEG            : OFF
--   USE_GFLAGS            : OFF
--   USE_GLOG              : OFF
--   USE_LEVELDB           : OFF
--   USE_LITE_PROTO        : OFF
--   USE_LMDB              : OFF
--   USE_METAL             : OFF
--   USE_PYTORCH_METAL     : OFF
--   USE_PYTORCH_METAL_EXPORT     : OFF
--   USE_MPS               : OFF
--   USE_FFTW              : OFF
--   USE_MKL               : OFF
--   USE_MKLDNN            : OFF
--   USE_UCC               : OFF
--   USE_ITT               : OFF
--   USE_NCCL              : OFF
--   USE_NNPACK            : ON
--   USE_NUMPY             : ON
--   USE_OBSERVERS         : ON
--   USE_OPENCL            : OFF
--   USE_OPENCV            : OFF
--   USE_OPENMP            : ON
--   USE_TBB               : OFF
--   USE_MIMALLOC          : OFF
--   USE_VULKAN            : 1
--     USE_VULKAN_FP16_INFERENCE    : OFF
--     USE_VULKAN_RELAXED_PRECISION : OFF
--   USE_PROF              : OFF
--   USE_QNNPACK           : ON
--   USE_PYTORCH_QNNPACK   : ON
--   USE_XNNPACK           : ON
--   USE_REDIS             : OFF
--   USE_ROCKSDB           : OFF
--   USE_ZMQ               : OFF
--   USE_DISTRIBUTED       : ON
--     USE_MPI               : OFF
--     USE_GLOO              : ON
--     USE_GLOO_WITH_OPENSSL : OFF
--     USE_TENSORPIPE        : ON

Vulkan backend runtime error❌

diff --git a/aten/src/ATen/native/vulkan/ops/Clamp.cpp b/aten/src/ATen/native/vulkan/ops/Clamp.cpp
index dc22b98..e9de33b 100644
--- a/aten/src/ATen/native/vulkan/ops/Clamp.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Clamp.cpp
@@ -398,6 +398,7 @@ Tensor& activation_scalar_(
 }
 
 Tensor gelu(const Tensor& self_arg, c10::string_view approximate) {
+  approximate = "tanh";
   TORCH_CHECK(
       approximate == "tanh", "Vulkan: gelu only supported for tanh type");
   Scalar kBetaVec = M_SQRT2 * M_2_SQRTPI * 0.5;

nnapi backend convert error❌

diff --git a/torch/backends/_nnapi/serializer.py b/torch/backends/_nnapi/serializer.py
index 6ff3f5f..08ff29f 100644
--- a/torch/backends/_nnapi/serializer.py
+++ b/torch/backends/_nnapi/serializer.py
@@ -830,6 +830,9 @@ class _NnapiSerializer:
         "aten::add": lambda self, node: self.add_add_sub_op(
             node, NNAPI_OperationCode.ADD, NNAPI_FuseCode.FUSED_NONE
         ),
+        "aten::add_": lambda self, node: self.add_add_sub_op(
+            node, NNAPI_OperationCode.ADD, NNAPI_FuseCode.FUSED_NONE
+        ),
         "aten::sub": lambda self, node: self.add_add_sub_op(
             node, NNAPI_OperationCode.SUB, NNAPI_FuseCode.FUSED_NONE
         ),
@@ -842,6 +845,9 @@ class _NnapiSerializer:
         "aten::relu": lambda self, node: self.add_pointwise_simple_unary_op(
             node, NNAPI_OperationCode.RELU
         ),
+        "aten::relu_": lambda self, node: self.add_pointwise_simple_unary_op(
+            node, NNAPI_OperationCode.RELU
+        ),
         "aten::sigmoid": lambda self, node: self.add_pointwise_simple_unary_op(
             node, NNAPI_OperationCode.LOGISTIC
         ),

Linux cpu mobile CPU backend error❌

Creating pytorch module: EMO_1M
(index: 363,  score: 7.098042), (index: 328,  score: 5.409191), (index: 769,  score: 5.257066), 
Creating pytorch module: EMO_2M
(index: 699,  score: 3.296871), (index: 520,  score: 3.237335), (index: 488,  score: 3.080106), 
Creating pytorch module: EMO_5M
(index: 741,  score: 5.842388), (index: 846,  score: 5.580930), (index: 885,  score: 5.049117), 
Creating pytorch module: EMO_6M
(index: 983,  score: 2.835817), (index: 2,  score: 2.807207), (index: 147,  score: 2.776361),

mobilevitv2

Creating pytorch module: mobilevitv2_050
(index: 470,  score: 37.888622), (index: 43,  score: 36.020939), (index: 887,  score: 32.847816), 
Creating pytorch module: mobilevitv2_075
(index: 155,  score: 24.285681), (index: 353,  score: 22.135773), (index: 517,  score: 22.006750), 
Creating pytorch module: mobilevitv2_100
(index: 193,  score: 24.245552), (index: 160,  score: 23.736561), (index: 386,  score: 22.456692), 
Creating pytorch module: mobilevitv2_125
(index: 852,  score: 29.719034), (index: 36,  score: 26.571074), (index: 357,  score: 25.765173), 
Creating pytorch module: mobilevitv2_150
(index: 91,  score: 42.911991), (index: 994,  score: 39.168468), (index: 15,  score: 39.123428), 
Creating pytorch module: mobilevitv2_175
(index: 355,  score: 44.312561), (index: 341,  score: 43.000805), (index: 733,  score: 42.077049), 
Creating pytorch module: mobilevitv2_200
(index: 975,  score: 42.539143), (index: 374,  score: 35.604237), (index: 988,  score: 34.880028),

mobilevit

reating pytorch module: mobilevit_xx_small
(index: 703,  score: 18.165537), (index: 646,  score: 14.215631), (index: 858,  score: 4.228817), 
Creating pytorch module: mobilevit_x_small
(index: 794,  score: 49.864689), (index: 838,  score: 40.894344), (index: 977,  score: 8.548670), 
Creating pytorch module: mobilevit_small
(index: 688,  score: 58.857357), (index: 545,  score: 1.859277), (index: 442,  score: -6.560894),

resnet50

Creating pytorch module: resnet50
(index: 227,  score: 26.693110), (index: 334,  score: 20.228640), (index: 278,  score: 17.633595),

Vulkan runtime situation

efficientformerv2❌

RuntimeError: Could not run 'aten::bmm.out' with arguments from the 'Vulkan' backend.

SwiftFormer❌

RuntimeError: Could not run 'aten::linalg_vector_norm.out' with arguments from the 'Vulkan' backend.

EMO❌

  File "__torch__.models.emo.___torch_mangle_1586", line 77, in forward
    prepack_folding_forward__jit_pass_packed_weight_1 = getattr(self, "prepack_folding_forward._jit_pass_packed_weight_1")
    _2 = ops.vulkan_prepack.run_conv2d_context(_1, prepack_folding_forward__jit_pass_packed_weight_1)
    _3 = torch.silu(_2)
         ~~~~~~~~~~ <--- HERE
    input = torch.mean(_3, [2, 3], True)
    prepack_folding_forward__jit_pass_packed_weight_2 = getattr(self, "prepack_folding_forward._jit_pass_packed_weight_2")
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

edgenext❌

  File "__torch__.models.edgenext.___torch_mangle_678", line 38, in forward
    prepack_folding_forward__jit_pass_packed_weight_0 = getattr(self, "prepack_folding_forward._jit_pass_packed_weight_0")
    _1 = ops.vulkan_prepack.run_conv2d_context(_0, prepack_folding_forward__jit_pass_packed_weight_0)
    u = torch.mean(_1, [1], True)
        ~~~~~~~~~~ <--- HERE
    _2 = torch.sub(_1, u)
    s = torch.mean(torch.pow(_2, 2), [1], True)
RuntimeError: Vulkan mean: currently only supports image-wide reduction!

mobilevitv2❌

  File "__torch__.models.mobilevit_v2.___torch_mangle_1452", line 79, in forward
    prepack_folding_forward__jit_pass_packed_weight_0 = getattr(self, "prepack_folding_forward._jit_pass_packed_weight_0")
    _1 = ops.vulkan_prepack.run_conv2d_context(_0, prepack_folding_forward__jit_pass_packed_weight_0)
    input = torch.silu(_1)
            ~~~~~~~~~~ <--- HERE
    prepack_folding_forward__jit_pass_packed_weight_1 = getattr(self, "prepack_folding_forward._jit_pass_packed_weight_1")
    _2 = ops.vulkan_prepack.run_conv2d_context(input, prepack_folding_forward__jit_pass_packed_weight_1)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

mobilevit❌

  File "__torch__.models.mobilevit.___torch_mangle_1356", line 47, in forward
    prepack_folding_forward__jit_pass_packed_weight_0 = getattr(self, "prepack_folding_forward._jit_pass_packed_weight_0")
    _1 = ops.vulkan_prepack.run_conv2d_context(_0, prepack_folding_forward__jit_pass_packed_weight_0)
    input = torch.silu(_1)
            ~~~~~~~~~~ <--- HERE
    prepack_folding_forward__jit_pass_packed_weight_1 = getattr(self, "prepack_folding_forward._jit_pass_packed_weight_1")
    _2 = ops.vulkan_prepack.run_conv2d_context(input, prepack_folding_forward__jit_pass_packed_weight_1)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

LeViT❌

RuntimeError: Could not run 'aten::as_strided' with arguments from the 'Vulkan' backend.

nnapi convert situation

efficientformerv2❌

Exception: Unsupported node kind ('aten::gelu') in node %input.5 : Tensor = aten::gelu(%input.1, %51), scope: __module.patch_embed/__module.patch_embed.2

SwiftFormer❌

Exception: Unsupported node kind ('aten::gelu') in node %input.19 : Tensor = aten::gelu(%input.17, %105), scope: __module.network.0/__module.network.0.0/__module.network.0.0.act

EMO❌

Exception: Unsupported node kind ('aten::silu_') in node %x.5 : Tensor = aten::silu_(%input.3), scope: __module.stage0.1/__module.stage0.1.conv_local/__module.stage0.1.conv_local.act

edgenext❌

Exception: Unsupported node kind ('aten::pow') in node %59 : Tensor = aten::pow(%58, %37), scope: __module.downsample_layers.0/__module.downsample_layers.0.1

mobilevitv2❌

Exception: Unsupported node kind ('aten::silu') in node %input.5 : Tensor = aten::silu(%input.1), scope: __module.conv_1/__module.conv_1.block/__module.conv_1.block.act

mobilevit❌

Exception: Unsupported node kind ('aten::silu') in node %input.5 : Tensor = aten::silu(%input.1), scope: __module.conv_1/__module.conv_1.block/__module.conv_1.block.act

LeViT❌

Exception: Unsupported node kind ('aten::hardswish') in node %input.5 : Tensor = aten::hardswish(%input.1), scope: __module.patch_embed/__module.patch_embed.1

pytorch android Quick Start! -- Improve PyTorch App Performance with Android NNAPI Support

Code 1. Install procedure of PyTorch and torchvision 跳过这一步，但仍要下pytorch源码

get pytorch source code

git clone https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive
pip install pyyaml

Code 2. Python script for model preparation & Code 3. Run the model preparation script

mkdir mobilenetv2-nnapi
python3 prepare_model.py

prepare_model.py

import sys
import os
import torch
import torch.utils.bundled_inputs
import torch.utils.mobile_optimizer
import torch.backends._nnapi.prepare
import torchvision.models.quantization.mobilenet
from pathlib import Path

# This script supports 3 modes of quantization:
# - "none": Fully floating-point model.
# - "core": Quantize the core of the model, but wrap it a quantizer/dequantizer pair, so the interface uses floating point.
# - "full": Quantize the model, and use quantized tensors for input and output.
#
# - "none" maintains maximum accuracy
# - "core" sacrifices some accuracy for performance, but maintains the same interface.
# - "full" maximized performance (with the same accuracy as "core"), but requires the application to use quantized tensors.
#
# There is a fourth option, not supported by this script,
# where we include the quant/dequant steps as NNAPI operators.
def make_mobilenetv2_nnapi(quantize_mode):
    quantize_core, quantize_iface = {
        "none": (False, False),
        "core": (True, False),
        "full": (True, True),
    }[quantize_mode]

    model = torchvision.models.quantization.mobilenet.mobilenet_v2(pretrained=True, quantize=quantize_core)
    model.eval()

    # Fuse BatchNorm operators in the floating point model.
    # (Quantized models already have this done.)
    # Remove dropout for this inference-only use case.
    if not quantize_core:
        model.fuse_model()
    model.classifier[0] = torch.nn.Identity()
 
    input_float = torch.zeros(1, 3, 224, 224)
    input_tensor = input_float

    # If we're doing a quantized model, we need to trace only the quantized core.
    # So capture the quantizer and dequantizer, use them to prepare the input,
    # and replace them with identity modules so we can trace without them.
    if quantize_core:
        quantizer = model.quant
        dequantizer = model.dequant
        model.quant = torch.nn.Identity()
        model.dequant = torch.nn.Identity()
        input_tensor = quantizer(input_float)

    # Many NNAPI backends prefer NHWC tensors, so convert our input to channels_last,
    # and set the "nnapi_nhwc" attribute for the converter.
    input_tensor = input_tensor.contiguous(memory_format=torch.channels_last)
    input_tensor.nnapi_nhwc = True

    # Trace the model.  NNAPI conversion only works with TorchScript models,
    # and traced models are more likely to convert successfully than scripted.
    with torch.no_grad():
        traced = torch.jit.trace(model, input_tensor)
    nnapi_model = torch.backends._nnapi.prepare.convert_model_to_nnapi(traced, input_tensor)
 
    # If we're not using a quantized interface, wrap a quant/dequant around the core.
    if quantize_core and not quantize_iface:
        nnapi_model = torch.nn.Sequential(quantizer, nnapi_model, dequantizer)
        model.quant = quantizer
        model.dequant = dequantizer
        # Switch back to float input for benchmarking.
        input_tensor = input_float.contiguous(memory_format=torch.channels_last)
 
    # Optimize the CPU model to make CPU-vs-NNAPI benchmarks fair.
    model = torch.utils.mobile_optimizer.optimize_for_mobile(torch.jit.script(model))
 
    # Bundle sample inputs with the models for easier benchmarking.
    # This step is optional.
    class BundleWrapper(torch.nn.Module):
        def __init__(self, mod):
            super().__init__()
            self.mod = mod
        def forward(self, arg):
            return self.mod(arg)

    nnapi_model = torch.jit.script(BundleWrapper(nnapi_model))
    torch.utils.bundled_inputs.augment_model_with_bundled_inputs(
        model, [(torch.utils.bundled_inputs.bundle_large_tensor(input_tensor),)])
    torch.utils.bundled_inputs.augment_model_with_bundled_inputs(
        nnapi_model, [(torch.utils.bundled_inputs.bundle_large_tensor(input_tensor),)])

    # Save both models.
    model._save_for_lite_interpreter("mobilenetv2-quant_{}-cpu.ptl".format(quantize_mode), _use_flatbuffer=True)
    nnapi_model._save_for_lite_interpreter("mobilenetv2-quant_{}-nnapi.ptl".format(quantize_mode), _use_flatbuffer=True)

if __name__ == "__main__":
    for quantize_mode in ["none", "core", "full"]:
        make_mobilenetv2_nnapi(quantize_mode)

Code 4.Build benchmarking program

export ANDROID_NDK=<xxxx>
ANDROID_NDK=$ANDROID_NDK ANDROID_NATIVE_API_LEVEL=30 ANDROID_ABI=arm64-v8a \
./scripts/build_android.sh -DBUILD_BINARY=ON -DBUILD_SHARED_LIBS=ON
ANDROID_NDK=$ANDROID_NDK ANDROID_NATIVE_API_LEVEL=30 ANDROID_ABI=arm64-v8a \
./scripts/build_android.sh -DBUILD_BINARY=ON -DBUILD_SHARED_LIBS=ON -DBUILD_LITE_INTERPRETER=OFF
# Installation completed, now you can copy the headers/libs from /home/albert/work/work/pytorch/build_android/install to your Android project directory.

Code 5. Run benchmarking program on a mobile device

terminating with uncaught exception of type c10::Error: PytorchStreamReader failed locating file bytecode.pkl: file not found ()

Hi, buddies) This information solved the problem for me. Hope this will be helpful for you as well. https://pytorch.org/tutorials/recipes/mobile_interpreter.html. spoiler alert. You should use script._save_for_lite_interpreter() instead script.save()

use android termux

./speed_benchmark_torch --model=ptl/mobilenetv2-quant_core-cpu.ptl --pthreadpool_size=1 --warmup=10 --iter=200 --use_bundled_input=0
# or
./speed_benchmark_torch --model=ptl/mobilenetv2-quant_core-cpu.ptl --pthreadpool_size=1 --warmup=10 --iter=200 --input_dims="1,3,224,224" --input_type="float"

但是这两个哪个是对的呢？结果相差很大啊

result diff

$ ./speed_benchmark_torch --model=ptl/mobilenetv2-quant_none-cpu.ptl --pthreadpool_size=1 --warmup=10 --iter=200 --input_dims="1,3,224,224" --input_type="float"
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 83254.1. Iters per second: 12.0114

$ ./speed_benchmark_torch --model=ptl/mobilenetv2-quant_none-cpu.ptl --pthreadpool_size=1 --warmup=10 --iter=200 --use_bundled_input=0                          
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 33761.5. Iters per second: 29.6195

鉴于运行nnpai模型时，报出以下错误：

terminating with uncaught exception of type c10::Error: The implementation of class __torch__.torch.classes._nnapi.Compilation cannot be found.
Exception raised from resolveType at ~/pytorch/torch/csrc/jit/mobile/flatbuffer_loader.cpp:196 (most recent call first):

参考About build_android.sh, LITE and NNAPI

对cmake文件作出一下修改：

diff --git a/aten/src/ATen/CMakeLists.txt b/aten/src/ATen/CMakeLists.txt
index edd789d..b8c7b56 100644
--- a/aten/src/ATen/CMakeLists.txt
+++ b/aten/src/ATen/CMakeLists.txt
@@ -191,7 +191,7 @@ add_subdirectory(quantized)
 add_subdirectory(nnapi)
 
 if(BUILD_LITE_INTERPRETER)
-  set(all_cpu_cpp ${generated_sources} ${core_generated_sources} ${cpu_kernel_cpp})
+  set(all_cpu_cpp ${generated_sources} ${core_generated_sources} ${ATen_NNAPI_SRCS} ${cpu_kernel_cpp})
   append_filelist("jit_core_sources" all_cpu_cpp)
   append_filelist("aten_cpu_source_non_codegen_list" all_cpu_cpp)
   append_filelist("aten_native_source_non_codegen_list" all_cpu_cpp)

attention! 需要执行 rm build_android -rf 再重新编译，不可直接重新编译

另外还有关于type的问题：

运行情况统计：

./speed_benchmark_torch --model=ptl/mobilenetv2-quant_none-nnapi.ptl --pthreadpool_size=1 --warmup=10 --iter=200 --input_dims="1,3,224,224" --input_type="float"
# khadas edge2
## 100%的CPU利用率
Main run finished. Microseconds per iter: 108929. Iters per second: 9.18031
# xiaomi mi8 snapdragon 845
## run on hexagon gpu 58%
Main run finished. Microseconds per iter: 14310.9. Iters per second: 69.8769
# 8gen2
## run on cpu
Main run finished. Microseconds per iter: 49963.1. Iters per second: 20.0148
# sanxing
## run on gpu
Main run finished. Microseconds per iter: 18037.9. Iters per second: 55.4387

./speed_benchmark_torch --model=ptl/mobilenetv2-quant_core-nnapi.ptl --pthreadpool_size=1 --warmup=10 --iter=200  --input_dims="1,3,224,224" --input_type="float"
# khadas edge2
## 170%的CPU利用率
Main run finished. Microseconds per iter: 370434. Iters per second: 2.69953
# xiaomi mi8 snapdragon 845
## run on hexagon dsp
Main run finished. Microseconds per iter: 9050.75. Iters per second: 110.488
# 8gen2
## run on hexagon???
Main run finished. Microseconds per iter: 293571. Iters per second: 3.40633
# sanxing
## run on gpu
Main run finished. Microseconds per iter: 13118.4. Iters per second: 76.2289

precision

reference

https://pytorch.org/docs/stable/quantization.html

New users of quantization are encouraged to try out FX Graph Mode Quantization first, if it does not work, user may try to follow the guideline of using FX Graph Mode Quantization or fall back to eager mode quantization.

attach a global qconfig, which contains information about what kind of observers to attach. Use x86 for server inference and qnnpack for mobile inference. Other quantization configurations such as selecting symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques can be specified here.

Note: the old fbgemm is still available but x86 is the recommended default for server inference.

# miniforge3/lib/python3.10/site-packages/torch/backends/quantized/__init__.py
# This function should correspond to the enums present in c10/core/QEngine.h
def _get_qengine_id(qengine: str) -> int:
    if qengine == "none" or qengine == "" or qengine is None:
        ret = 0
    elif qengine == "fbgemm":
        ret = 1
    elif qengine == "qnnpack":
        ret = 2
    elif qengine == "onednn":
        ret = 3
    elif qengine == "x86":
        ret = 4
    else:
        ret = -1
        raise RuntimeError(f"{qengine} is not a valid value for quantized engine")
    return ret


# This function should correspond to the enums present in c10/core/QEngine.h
def _get_qengine_str(qengine: int) -> str:
    all_engines = {0: "none", 1: "fbgemm", 2: "qnnpack", 3: "onednn", 4: "x86"}
    return all_engines.get(qengine, "*undefined")

result

prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers
    raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))
RuntimeError: Tensor type unknown to einops <class 'torch.fx.proxy.Proxy'>

prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers
    raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))
RuntimeError: Tensor type unknown to einops <class 'torch.fx.proxy.Proxy'>

edgenext

prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers
    mask = torch.zeros(B, H, W).bool().to(self.token_projection.weight.device)
TypeError: zeros() received an invalid combination of arguments - got (Proxy, Proxy, Proxy), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (tuple of ints size, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

mobilevitv2

prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers
    raise TraceError('symbolically traced variables cannot be used as inputs to control flow')
torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow

mobilevit

prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers
    new_h = int(math.ceil(orig_h / self.patch_h) * self.patch_h)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'Proxy'

tf_efficientnetv2

prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers
    raise TraceError('symbolically traced variables cannot be used as inputs to control flow')
torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow

torch - YingkunZhou/EdgeTransformerBench GitHub Wiki

Quick Start!

reference

pytorch mobile c++ minimal demo

how to build

Vulkan runtime situation

nnapi convert situation

pytorch android Quick Start! -- Improve PyTorch App Performance with Android NNAPI Support

precision

reference

result

misc

⚠️ GitHub.com Fallback ⚠️

torch - YingkunZhou/EdgeTransformerBench GitHub Wiki

Quick Start!

reference

pytorch mobile c++ minimal demo

how to build

Vulkan runtime situation

nnapi convert situation

pytorch android Quick Start! -- Improve PyTorch App Performance with Android NNAPI Support

precision

reference

result

misc

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️