tensorflow lite - YingkunZhou/EdgeTransformerBench GitHub Wiki

how to build

# by default, we use system officical clang
export CC=/usr/bin/clang-16
export CXX=/usr/bin/clang++-16
# if encount uint8_t not defined and header file at the top of c++ file
#include <cstdint>
build commands
# use base env is enough
conda install gxx=12.3.0 clang clangxx bazel cmake pkg-config

git clone https://github.com/tensorflow/tensorflow.git --depth=1
# fef54a90b1c2aacd6ec8625be86ff45a51a290a0
cd tensorflow
./configure
# use clang to build
#You have Clang 16.0.6 installed.
#Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -Wno-sign-compare]: -O3

bazel build --verbose_failures -c opt //tensorflow/lite:tensorflowlite --define tflite_with_xnnpack=true # --jobs 4
bazel build --verbose_failures -c opt --config=monolithic tensorflow/lite/delegates/flex:tensorflowlite_flex --define tflite_with_xnnpack=true # --jobs 4

# prepare include directory
cp -r tensorflow/lite include/tensorflow
cp -r tensorflow/core include/tensorflow # for armnn
find . ! \( -name '*.h' \) -type f -exec rm -f {} +
cd -
git clone https://github.com/google/flatbuffers.git --depth=1
cp -r flatbuffers/include/flatbuffers include

Note that in this case Interpreter::SetNumThreads invocation does not take effect on number of threads used by XNNPACK engine. In order to specify number of threads available for XNNPACK engine you should manually pass the value when constructing the interpreter. The snippet below illustrates this assuming you are using InterpreterBuilder to construct the interpreter:

XNNPACK engine used by TensorFlow Lite interpreter uses a single thread for inference by default.

注意include(tensorflow, flatbuffers, armnn)和lib要是一致的,不能狗尾续貂

use a x86 machine

为此,我们统一使用API == 30

build commands
### prepare basic env
conda create -n py3.9 python=3.9 pip ipython
conda activate py3.9
conda install gxx=12.3.0 clang clangxx bazel cmake pkg-config

### prepare android sdk
# https://mirrors.cloud.tencent.com/AndroidSDK/
wget https://mirrors.cloud.tencent.com/AndroidSDK/commandlinetools-linux-8512546_latest.zip
unzip commandlinetools-linux-6858069_latest.zip
mkdir android-sdk && cd android-sdk
mkdir cmdline-tools
mv ../cmdline-tools/ cmdline-tools/latest
# https://developer.android.com/tools/sdkmanager
./cmdline-tools/latest/bin/sdkmanager "platform-tools" "platforms;android-33" "build-tools;34.0.0" 

### prepare andoid ndk
wget https://dl.google.com/android/repository/android-ndk-r22b-linux-x86_64.zip
unzip android-ndk-r22b-linux-x86_64.zip
# WARNING: The NDK version in android-ndk-r22b is 22, which is not supported by Bazel (officially supported versions: [19, 20, 21]). Please use another version. Compiling Android targets may result in confusing errors.
wget https://dl.google.com/android/repository/android-ndk-r21e-linux-x86_64.zip
unzip android-ndk-r21e-linux-x86_64.zip
# clang: error: the clang compiler does not support '-march=armv8.2-a+i8mm'

solution:
cd android-ndk-r21e
mv toolchains toolchains.bak
ln -s ../android-ndk-r22b/toolchains .

# export LD_LIBRARY_PATH=$HOME/miniforge3/envs/py3.9/lib
# tflite_with_xnnpack default is true
# can use following command to disable xnnpack
# bazel build -c opt --config=android_arm64 --cpu=arm64-v8a --define tflite_with_xnnpack=true //tensorflow/lite:tensorflowlite
bazel build --verbose_failures -c opt --config=android_arm64 //tensorflow/lite:tensorflowlite --define tflite_with_xnnpack=true # --jobs 4
bazel build --verbose_failures -c opt --config=android_arm64 --config=monolithic tensorflow/lite/delegates/flex:tensorflowlite_flex --define tflite_with_xnnpack=true # --jobs 4
cd bazel-bin
tar cf tensorflow-lite.tar tensorflow/lite

reference:

how to convert model

note: 参考脚本 onnx-tflite.sh

attention! 为了防止各种其他python版本导致的诡异情况,这里我们统一使用python 3.9环境来转换模型

convert commands
conda create -n py3.9 python=3.9 pip ipython
conda activate py3.9

pytorch --> onnx --> tf --> tflite

find a x86-64 machine

  • pip install tensorflow
  • pip install git+https://github.com/onnx/onnx-tensorflow.git
  • onnx-tf convert -i onnx/mobilevitv2_050.onnx -o tflite/mobilevitv2_050.pb
  • python .tf-tflite.py --only-convert=mobilevitv2_050
  • efficientformerv2 ✅
  • SwiftFormer
    • onnx opset-version 12
      • cpu ✅
      • gpu ❌
onnx opset-version > 12 ❌convert error
tensorflow.lite.python.convert_phase.ConverterError: <unknown>:0: error: loc(callsite(callsite(fused["Reshape:", "onnx_tf_prefix_/network.0/network.0.2/attn/Reshape_1@__inference___call___2760"] at fused["PartitionedCall:", "PartitionedCall@__inference_signature_wrapper_3178"]) at fused["PartitionedCall:", "PartitionedCall"])): 'tfl.reshape' op requires 'output' number of elements to match 'input' number of elements, but got 150528 and 48
  • EMO ✅ remove useless F.pad, see emo.patch
❌conversion error
    File "~/miniforge3/envs/py3.9/lib/python3.9/site-packages/onnx_tf/backend_tf_module.py", line 99, in __call__  *
        output_ops = self.backend._onnx_node_to_tensorflow_op(onnx_node,
    File "~/miniforge3/envs/py3.9/lib/python3.9/site-packages/onnx_tf/backend.py", line 347, in _onnx_node_to_tensorflow_op  *
        return handler.handle(node, tensor_dict=tensor_dict, strict=strict)
    File "~/miniforge3/envs/py3.9/lib/python3.9/site-packages/onnx_tf/handlers/handler.py", line 59, in handle  *
        return ver_handle(node, **kwargs)
    File "~/miniforge3/envs/py3.9/lib/python3.9/site-packages/onnx_tf/handlers/backend/pad.py", line 95, in version_13  *
        return cls._common(node, **kwargs)
    File "~/miniforge3/envs/py3.9/lib/python3.9/site-packages/onnx_tf/handlers/backend/pad.py", line 73, in _common  *
        constant_values = tensor_dict[node.inputs[2]] if len(

    KeyError: ''
      constant_values = tensor_dict[node.inputs[2]] if len(
          node.inputs) == 3 else 0
      # node.inputs[2] == ''

image

❌runtime error
Creating tflite runtime interpreter: EMO_1M
INFO: Created TensorFlow Lite delegate for select TF ops.
INFO: TfLiteFlexDelegate delegate: 100 nodes delegated out of 5709 nodes with 74 partitions.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: tensorflow/lite/kernels/reshape.cc:92 num_input_elements != num_output_elements (0 != 8)
ERROR: Node number 0 (RESHAPE) failed to prepare.
ERROR: Node number 720 (IF) failed to prepare.
[1]    3647892 segmentation fault (core dumped)
  • edgenext
    • cpu ✅
    • gpu ❌
  • mobilevitv2 ✅
  • mobilevit
    • cpu ✅
    • gpu ❌
  • LeViT ✅
  • tf_efficientnetv2
    • onnx opset-version 12
      • cpu ✅
      • gpu ❌
onnx opset-version > 12 ❌ runtime error
ERROR: tensorflow/lite/kernels/reshape.cc:92 num_input_elements != num_output_elements (0 != 8)
ERROR: Node number 0 (RESHAPE) failed to prepare.
ERROR: Node number 82 (IF) failed to prepare.
Segmentation fault (core dumped)

how to use

https://blog.tensorflow.org/2020/07/accelerating-tensorflow-lite-xnnpack-integration.html

execution demo
export TFLITE_LIB=$HOME/work/tensorflow/lib
export TFLITE_INC=$HOME/work/tensorflow/include
g++ -O3 -o tflite_perf tflite_perf.cpp utils.cpp  -std=c++17 -I$TF_INC -L$TF_LIB -ltensorflowlite -ltensorflowlite_flex `pkg-config --cflags --libs opencv4`
export LD_PRELOAD=$TFLITE_LIB/libtensorflowlite_flex.so
LD_LIBRARY_PATH=$TFLITE_LIB ./tflite_perf --only-test=eff #2>/dev/null

# without xnnpack
Creating tflite runtime interpreter: resnet50
INFO: Initialized TensorFlow Lite runtime.
(index: 985,  score: 7.986878), (index: 113,  score: -5.246380), (index: 310,  score: -5.445833), 
min =   320.72ms        max =   322.12ms        mean =  321.10ms        median =        321.04ms
Creating tflite runtime interpreter: mobilenetv3_large_100
INFO: Initialized TensorFlow Lite runtime.
(index: 985,  score: 9.726583), (index: 310,  score: 2.717167), (index: 308,  score: 2.388680), 
min =   46.04ms max =   47.13ms mean =  46.23ms median =        46.19ms

# with xnnpack
Creating tflite runtime interpreter: resnet50
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 125 out of 127 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 5 partitions for the whole graph.
(index: 985,  score: 7.986875), (index: 113,  score: -5.246378), (index: 310,  score: -5.445824), 
min =   274.13ms        max =   275.33ms        mean =  274.61ms        median =        274.57ms
Creating tflite runtime interpreter: mobilenetv3_large_100
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 295 out of 304 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 19 partitions for the whole graph.
(index: 985,  score: 9.726580), (index: 310,  score: 2.717167), (index: 308,  score: 2.388680), 
min =   28.27ms max =   30.86ms mean =  28.47ms median =        28.42ms

有件非常诡异的事情,这个我之前在苹果上转换的模型无法复现,重新转换的模型resnet50速度奇快!

Creating tflite runtime interpreter: resnet50
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 71 out of 127 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 72 partitions for the whole graph.
(index: 985,  score: 8.152998), (index: 113,  score: -5.376539), (index: 310,  score: -5.619974), 
[179 iters] min = 111.98ms max = 113.36ms median = 112.13ms mean = 112.23ms

TFLite GPU for Android C/C++ 使用 Bazel 构建系统。

例如,可以使用以下命令构建委托:
bazel build -c opt --config=android_arm64 tensorflow/lite/delegates/gpu:delegate                           # for static library
bazel build -c opt --config=android_arm64 tensorflow/lite/delegates/gpu:libtensorflowlite_gpu_delegate.so  # for dynamic library
image

在 GPU 上运行量化模型 本部分将说明 GPU 委托如何加速 8 位量化模型。这包括所有量化方式,包括:

注:虽然从 API 级别 27 (Android Oreo MR1) 开始就支持 NNAPI,但是,在 API 级别 28 (Android Pie) 及以后的版本上,对运算的支持大有改善。因此,对于大多数情形,我们建议开发者为 Android Pie 或更高的版本使用 NNAPI 委托。

为此,我们统一使用API == 30

referenced by: https://blog.seeso.io/building-tensorflow-lite-c-in-android-1c8de1639e1d

build commands
bazel build -c opt --config=android_arm64 //tensorflow/lite/nnapi:nnapi_util
bazel build -c opt --config=android_arm64 //tensorflow/lite/nnapi:nnapi_implementation
bazel build -c opt --config=android_arm64 //tensorflow/lite/delegates/nnapi:nnapi_delegate_no_nnapi_implementation
---
# or
bazel build -c opt --config=android_arm64 //tensorflow/lite/delegates/nnapi:nnapi_delegate
bazel build -c opt --config=android_arm64 //tensorflow/lite/nnapi:nnapi_implementation
bazel build -c opt --config=android_arm64 //tensorflow/lite/nnapi:nnapi_util

You must use static library for implementation(libnnapi_delegate_no_nnapi_implementation.a ), and shared library for the others(libnnapi_implementation.so , libnnapi_util.so ).

如果 NNAPI 委托不支持模型中的某些运算或参数组合,则框架只会在加速器上运行受支持的计算图部分。剩下的计算图将在 CPU 上运行,这会产生执行拆分。由于 CPU/加速器同步的开销很大,因此,这会导致性能比完全在 CPU 上执行整个网络时更低。

当模型仅使用支持的运算时,NNAPI 表现最佳。已知下面的模型与 NNAPI 兼容:

于是乎下载一个官方支持的MobileNet v1 (224x224) 图像分类(浮点模型下载) (量化模型下载) (专为基于移动设备和嵌入式设备视觉应用设计的图像分类模型)

wget http://download.tensorflow.org/models/mobilenet_v1_2018_08_02/mobilenet_v1_1.0_224_quant.tgz
需要注意还需要按照一下子方式修改tflite_perf.cpp文件,才能够正常运行:
diff --git a/tflite_perf.cpp b/tflite_perf.cpp
index 8b0bb8a..36698e3 100644
--- a/tflite_perf.cpp
+++ b/tflite_perf.cpp
@@ -77,8 +77,13 @@ void benchmark(
     std::unique_ptr<Interpreter> &interpreter)
 {
     // Measure latency
+#if 0
     float *input_tensor = interpreter->typed_input_tensor<float>(0);
     load_image("daisy.jpg", input_tensor, args.model, args.input_size, args.batch_size);
+#else
+    uint8_t *input_tensor = interpreter->typed_input_tensor<uint8_t>(0);
+    for (int i = 0; i < 3*224*224; i++) input_tensor[i] = 1;
+#endif
 
     struct timespec start, end;
     clock_gettime(CLOCK_REALTIME, &end);
@@ -92,8 +97,13 @@ void benchmark(
     }
 #endif
 
+#if 0
     float *output_tensor = interpreter->typed_output_tensor<float>(0);
     print_topk(output_tensor, 3);
+#else
+    uint8_t *output_tensor = interpreter->typed_output_tensor<uint8_t>(0);
+    std::cout << "yes:" << output_tensor[0] << std::endl;
+#endif
 #if defined(TEST)
     return;
 #endif

pip install tensorflowlite_runtime能够查看模型输入输出的详细信息

import tflite_runtime.interpreter as tflite
interpreter = tflite.Interpreter(model_path='v3-large_224_1.0_uint8/v3-large_224_1.0_uint8.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()[0]
input_details['dtype']
input_details["quantization"]
output_details = interpreter.get_output_details()[0]
output_details['dtype']
  • 在小米8的骁龙845和三星Exynos990上是可以看到nnapi的卓越性能的!但在khadas edge2上nnapi会全部在CPU上执行!
  • 另外小米8用nnapi,resnet50模型推理结果严重错误❌!
  • 另外从resnet50的CPU结果可以看到,其结果的严重偏差

tensorflow lite with armnn

build commands
conda install bazel (==6.3.0)
export BASEDIR=$PWD
git clone https://github.com/tensorflow/tensorflow.git --depth=1
cd tensorflow/
vim BUILD
cc_binary(
     name = "libtensorflow_lite_all.so",
     linkshared = 1,
     deps = [
         "//tensorflow/lite:framework",
         "//tensorflow/lite/kernels:builtin_ops",
     ],
)
./configure
# choose clang, and use -O3 option
bazel build --config=opt --config=monolithic --strip=always libtensorflow_lite_all.so
cd $BASEDIR
git clone https://github.com/google/flatbuffers --depth=1
mkdir build && cd build
cmake .. -D CMAKE_INSTALL_PREFIX=../install
make install -j32
cd $BASEDIR
git clone https://review.mlplatform.org/ml/ComputeLibrary --depth=1
cd ComputeLibrary/
# git checkout <tag_name> # e.g. v20.11
# The machine used for this guide only has a Neon CPU which is why I only have "neon=1" but if 
# your machine has an arm Gpu you can enable that by adding `opencl=1 embed_kernels=1 to the command below
diff --git a/SConstruct b/SConstruct
index 68c518a..05dfe9f 100644
--- a/SConstruct
+++ b/SConstruct
@@ -381,7 +381,7 @@ if 'x86' not in env['arch']:
             auto_toolchain_prefix = "armv7l-tizen-linux-gnueabi-"
     elif env['estate'] == '64' and 'v8' in env['arch']:
         if env['os'] == 'linux':
-            auto_toolchain_prefix = "aarch64-linux-gnu-"
+            auto_toolchain_prefix = ""
         elif env['os'] == 'bare_metal':
             auto_toolchain_prefix = "aarch64-elf-"
         elif env['os'] == 'android':
scons arch=arm64-v8a neon=1 extra_cxx_flags="-fPIC" benchmark_tests=0 validation_tests=0 -j 32
cd $BASEDIR
git clone "https://review.mlplatform.org/ml/armnn" --depth=1
cd armnn
# git checkout <branch_name> # e.g. branches/armnn_20_11
diff --git a/src/armnn/ExecutionFrame.cpp b/src/armnn/ExecutionFrame.cpp
index 92a7990..118fa7e 100644
--- a/src/armnn/ExecutionFrame.cpp
+++ b/src/armnn/ExecutionFrame.cpp
@@ -39,7 +39,7 @@ void ExecutionFrame::RegisterDebugCallback(const DebugCallbackFunction& func)
 
 void ExecutionFrame::AddWorkloadToQueue(std::unique_ptr<IWorkload> workload)
 {
-    m_WorkloadQueue.push_back(move(workload));
+    m_WorkloadQueue.push_back(std::move(workload));
 }
 
 void ExecutionFrame::SetNextExecutionFrame(IExecutionFrame* nextExecutionFrame)
mkdir build && cd build
# if you've got an arm Gpu add `-DARMCOMPUTECL=1` to the command below
cmake .. -DARMCOMPUTE_ROOT=$BASEDIR/ComputeLibrary \
         -DARMCOMPUTENEON=1 \
         -DBUILD_UNIT_TESTS=0 \
         -DBUILD_ARMNN_TFLITE_DELEGATE=1 \
         -DTENSORFLOW_ROOT=$BASEDIR/tensorflow \
         -DTFLITE_LIB_ROOT=$BASEDIR/tensorflow/bazel-bin \
         -DFLATBUFFERS_ROOT=$BASEDIR/flatbuffers/install \
         -D CMAKE_CXX_FLAGS="-Wno-error=missing-field-initializers -Wno-error=deprecated-declarations"
make -j32
cd $libtensorflow
mkdir include/armnn
cp -r $BASEDIR/armnn/include  include/armnn
cp -r $BASEDIR/armnn/delegate include/armnn
cp ~/work/tmp/armnn/build/libarmnn.so.33.0 lib
cp ~/work/tmp/armnn/build/delegate/libarmnnDelegate.so.29.0 lib
cd lib; ln -s libarmnn.so.33.0 libarmnn.so; ln -s libarmnn.so.33.0 libarmnn.so.33; ln -s libarmnnDelegate.so.29.0 libarmnnDelegate.so; ln -s libarmnnDelegate.so.29.0 libarmnnDelegate.so.29

demo

mkdir $BASEDIR/benchmarking
cd $BASEDIR/benchmarking
# Get the benchmarking binary.
wget https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/linux_aarch64_benchmark_model -O benchmark_model
# Make it executable.
chmod +x benchmark_model
# and a sample model from model zoo.
wget https://github.com/ARM-software/ML-zoo/blob/master/models/image_classification/mobilenet_v2_1.0_224/tflite_uint8/mobilenet_v2_1.0_224_quantized_1_default_1.tflite?raw=true -O mobilenet_v2_1.0_224_quantized_1_default_1.tflite
cd $BASEDIR/benchmarking
LD_LIBRARY_PATH=../armnn/build ./benchmark_model --graph=mobilenet_v2_1.0_224_quantized_1_default_1.tflite --external_delegate_path="../armnn/build/delegate/libarmnnDelegate.so" --external_delegate_options="backends:CpuAcc;logging-severity:info"
这个`external_delegate_path`还挺有意思的,不需要编译时链接,运行时链接,牛逼 image
这里不得不说,tensorflow lite支持的硬件后端以及不同的推理后端还是挺多的 image

how to build tensorflow lite benchmark_model

You can specify more optional parameters for running the benchmark.

build commands
bazel build -c opt tensorflow/lite/tools/benchmark:benchmark_model
bazel build -c opt --config=monolithic tensorflow/lite/tools/benchmark:benchmark_model_plus_flex --jobs 8
execution demo
LD_LIBRARY_PATH=../armnn/build ./benchmark_model_plus_flex --graph=tflite/resnet50.tflite
LD_LIBRARY_PATH=../armnn/build ./benchmark_model_plus_flex --graph=tflite/edgenext_small.tflite --use_xnnpack=false
LD_LIBRARY_PATH=../armnn/build ./benchmark_model_plus_flex --graph=tflite/resnet50.tflite --external_delegate_path="../armnn/build/delegate/libarmnnDelegate.so" --external_delegate_options="backends:CpuAcc"

python armnn demo

Image Classification with the Arm NN Tensorflow Lite Delegate

步骤参考上面👆链接

为了方便起见,这里就是用了预先编译好的库:https://github.com/ARM-software/armnn/releases/tag/v23.08

对`samples/ImageClassification/run_classifier.py`进行魔改
diff --git a/samples/ImageClassification/run_classifier.py b/samples/ImageClassification/run_classifier.py
index 4ce8b8b84..b7a5cb635 100644
--- a/samples/ImageClassification/run_classifier.py
+++ b/samples/ImageClassification/run_classifier.py
@@ -24,19 +24,6 @@ def check_args(args: argparse.Namespace):
       - FileNotFoundError: if passed files do not exist.
       - IOError: if files are of incorrect format.
     """
-    input_image_p = args.input_image
-    if not input_image_p.suffix in (".png", ".jpg", ".jpeg"):
-        raise IOError(
-            "--input_image option should point to an image file of the "
-            "format .jpg, .jpeg, .png"
-        )
-    if not input_image_p.exists():
-        raise FileNotFoundError("Cannot find ", input_image_p.name)
-    model_p = args.model_file
-    if not model_p.suffix == ".tflite":
-        raise IOError("--model_file should point to a tflite file.")
-    if not model_p.exists():
-        raise FileNotFoundError("Cannot find ", model_p.name)
     label_mapping_p = args.label_file
     if not label_mapping_p.suffix == ".txt":
         raise IOError("--label_file expects a .txt file.")
@@ -51,28 +38,64 @@ def check_args(args: argparse.Namespace):
 
     return None
 
-
-def load_image(image_path: Path, model_input_dims: Union[tuple, list], grayscale: bool):
-    """load an image and put into correct format for the tensorflow lite model
-
-    args:
-      - image_path: pathlib.Path
-      - model_input_dims: tuple (or array-like). (height,width)
-
-    returns:
-      - image: np.array
-    """
-    height, width = model_input_dims
-    # load and resize image
-    image = Image.open(image_path).resize((width, height))
-    # convert to greyscale if expected
-    if grayscale:
-        image = image.convert("LA")
-
-    image = np.expand_dims(image, axis=0)
-
-    return image
-
+import torch
+from torchvision import transforms
+from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+
+def pil_loader_RGB(path: str) -> Image.Image:
+    with open(path, "rb") as f:
+        img = Image.open(f)
+        return img.convert("RGB")
+
+# For mobilevit, images are expected to be in BGR pixel order, not RGB.
+# https:www.adamsmith.haus/python/answers/how-to-rotate-image-colors-from-rgb-to-bgr-in-python
+def pil_loader_BGR(path: str) -> Image.Image:
+    with open(path, "rb") as f:
+        img = Image.open(f)
+        R, G, B = img.convert("RGB").split()
+        return Image.merge("RGB", (B, G, R))
+
+def get_transform(args):
+    is_resnet50  = "resnet50" in args.model
+    is_edgenext  = "edgenext" in args.model
+    is_mobilevit = "mobilevit" in args.model
+    is_efficientnetv2_b3 = "efficientnetv2_b3" in args.model
+
+    t = []
+
+    if is_resnet50 or args.usi_eval: # for EdgeNeXt
+        crop_pct = 0.95
+        size = int(args.input_size / crop_pct)
+    elif is_edgenext:
+        crop_pct = 224 / 256
+        size = int(args.input_size / crop_pct)
+    else:
+        size = args.input_size + 32
+
+    t.append(
+        # to maintain same ratio w.r.t. 224 images
+        transforms. Resize(size, interpolation=transforms. InterpolationMode.BICUBIC),
+    )
+    t.append(transforms. CenterCrop(args.input_size))
+    t.append(transforms. ToTensor())
+    if is_mobilevit:
+        pass
+    elif is_efficientnetv2_b3:
+        t.append(transforms. Normalize([0.5,0.5,0.5], [0.5,0.5,0.5]))
+    else:
+        t.append(transforms. Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD))
+
+    return transforms. Compose(t)
+
+def load_image(args):
+    data_transform = get_transform(args)
+    image = pil_loader_BGR('daisy.jpg') if "mobilevit_" in args.model \
+        else pil_loader_RGB('daisy.jpg')
+
+    # [N, C, H, W]
+    image = data_transform(image)
+    # expand batch dimension
+    return torch.unsqueeze(image, dim=0)

 def load_delegate(delegate_path: Path, backends: list):
     """load the armnn delegate.
@@ -95,7 +118,7 @@ def load_delegate(delegate_path: Path, backends: list):
     return armnn_delegate


-def load_tf_model(model_path: Path, armnn_delegate: tflite. Delegate):
+def load_tf_model(model_path, armnn_delegate: tflite. Delegate):
     """load a tflite model for use with the armnn delegate.

     args:
@@ -106,7 +129,7 @@ def load_tf_model(model_path: Path, armnn_delegate: tflite. Delegate):
       - interpreter: tflite. Interpreter
     """
     interpreter = tflite. Interpreter(
-        model_path=model_path.as_posix(), experimental_delegates=[armnn_delegate]
+        model_path=model_path, experimental_delegates=[armnn_delegate]
     )
     interpreter.allocate_tensors()

@@ -186,13 +209,15 @@ def main(args):
     # load in the armnn delegate
     armnn_delegate = load_delegate(args.delegate_path, args.preferred_backends)
     # load tflite model
-    interpreter = load_tf_model(args.model_file, armnn_delegate)
+    interpreter = load_tf_model('tflite/'+args.model+'.tflite', armnn_delegate)
     # get input shape for image resizing
     input_shape = interpreter.get_input_details()[0]["shape"]
     height, width = input_shape[1], input_shape[2]
     input_shape = (height, width)
     # load input image
-    input_image = load_image(args.input_image, input_shape, False)
+    args.input_size = 224
+    args.usi_eval = False
+    input_image = load_image(args)
     # get label mapping
     labelmapping = create_mapping(args.label_file)
     output_tensor = run_inference(interpreter, input_image)
@@ -208,12 +233,9 @@ if __name__ == "__main__":
         formatter_class=argparse. ArgumentDefaultsHelpFormatter
     )
     parser.add_argument(
-        "--input_image", help="File path of image file", type=Path, required=True
-    )
-    parser.add_argument(
-        "--model_file",
+        "--model",
         help="File path of the model tflite file",
-        type=Path,
+        type=str,
         required=True,
     )
     parser.add_argument(
run it!
python3 run_classifier.py \
--model resnet50 --label_file labelmappings.txt \
--delegate_path $PWD/libarmnnDelegate.so \ # 指定自己本地的libarmnnDelegate.so库
--preferred_backends CpuAcc CpuRef

precison

reference

image
converter.optimizations = [tf.lite.Optimize.DEFAULT]

推断时,权重从 8 位精度转换为浮点,并使用浮点内核进行计算。此转换会完成一次并缓存,以减少延迟。 为了进一步改善延迟,“动态范围”算子会根据激活的范围将其动态量化为 8 位,并使用 8 位权重和激活执行计算。此优化提供的延迟接近全定点推断。但是,输出仍使用浮点进行存储,因此使用动态范围算子的加速小于全定点计算。

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

默认情况下,float16 量化模型在 CPU 上运行时会将权重值“反量化”为 float32。(请注意,GPU 委托不会执行此反量化,因为它可以对 float16 数据进行运算。) 您还可以在 GPU 上评估 fp16 量化模型。要使用降低的精度值执行所有算术,请确保在您的应用中创建 TfLiteGPUDelegateOptions 结构,并将 precision_loss_allowed 设置为 1:

converter.representative_dataset = representative_dataset
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8, tf.lite.OpsSet.TFLITE_BUILTINS]

result

CPU FP32 by tinynn conversion
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 75.4 75.3 3.5M 0.40G
efficientformerv2_s1 - 77.0 74.3 6.1M 0.65G
efficientformerv2_s2 - 81.6 80.9 12.6M 1.25G
SwiftFormer_XS - 75.6 74.1 3.5M 0.4G
SwiftFormer_S - 77.4 77.4 6.1M 1.0G
SwiftFormer_L1 - 80.0 81.2 12.1M 1.6G
EMO_1M - 70.4 68.7 1.3M 0.26G
EMO_2M - 74.5 73.9 2.3M 0.44G
EMO_5M - 78.2 77.1 5.1M 0.90G
EMO_6M - 77.9 78.5 6.1M 0.96G
edgenext_xx_small - 70.5 71.2 1.3M 0.26G
edgenext_x_small - 74.4 74.5 2.3M 0.54G
edgenext_small/usi - 78.0 79.5 5.6M 1.26G
mobilevitv2_050 - 69.9 66.6 1.4M 0.5G
mobilevitv2_075 - 75.1 74.3 2.9M 1.0G
mobilevitv2_100 - 77.9 76.9 4.9M 1.8G
mobilevitv2_125 - 79.2 80.7 7.5M 2.8G
mobilevitv2_150 - 80.9 81.8 10.6M 4.0G
mobilevitv2_175 - 80.7 81.0 14.3M 5.5G
mobilevitv2_200 - 82.0 83.1 18.4M 7.2G
mobilevit_xx_small - 68.9 66.5 1.3M 0.36G
mobilevit_x_small - 74.0 73.7 2.3M 0.89G
mobilevit_small - 77.6 77.9 5.6M 2.0 G
LeViT_128S - 75.9 76.1 7.8M 0.30G
LeViT_128 - 79.4 78.1 9.2M 0.41G
LeViT_192 - 79.6 79.6 11 M 0.66G
LeViT_256 - 81.1 81.4 19 M 1.12G
resnet50 - 79.6 81.3 25.6M 4.1G
mobilenetv3_large_100 - 75.6 75.3 5.5M 0.29G

tf_efficientnetv2_b0

ERROR (tinynn.converter.base) Unsupported ops: aten::ceil

CPU dynmaic INT8 by tinynn conversion
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 41.7 38.8 3.5M 0.40G
efficientformerv2_s1 - 48.4 40.0 6.1M 0.65G
efficientformerv2_s2 - 57.4 56.0 12.6M 1.25G
SwiftFormer_XS - 72.4 72.2 3.5M 0.4G
SwiftFormer_S - 56.4 49.9 6.1M 1.0G
SwiftFormer_L1 - 72.9 72.4 12.1M 1.6G
EMO_1M - 69.0 66.1 1.3M 0.26G
EMO_2M - 74.0 73.4 2.3M 0.44G
EMO_5M - 76.6 75.0 5.1M 0.90G
EMO_6M - 77.2 78.0 6.1M 0.96G
edgenext_xx_small - 69.9 70.3 1.3M 0.26G
edgenext_x_small - 74.2 75.0 2.3M 0.54G
edgenext_small/usi - 79.9 79.7 5.6M 1.26G
mobilevitv2_050 - 35.6 28.8 1.4M 0.5G
mobilevitv2_075 - 49.3 41.6 2.9M 1.0G
mobilevitv2_100 - 34.1 27.4 4.9M 1.8G
mobilevitv2_125 - 0 0 7.5M 2.8G
mobilevitv2_150 - 49.2 40.4 10.6M 4.0G
mobilevitv2_175 - 54.3 46.2 14.3M 5.5G
mobilevitv2_200 - 67.8 62.8 18.4M 7.2G
mobilevit_xx_small - 1.2 3.5 1.3M 0.36G
mobilevit_x_small - 35.1 30.3 2.3M 0.89G
mobilevit_small - 54.0 47.9 5.6M 2.0 G
LeViT_128S - 72.7 72.7 7.8M 0.30G
LeViT_128 - 77.0 76.7 9.2M 0.41G
LeViT_192 - 78.5 78.0 11 M 0.66G
LeViT_256 - 78.7 78.3 19 M 1.12G
resnet50 - 79.7 80.3 25.6M 4.1G
mobilenetv3_large_100 - 74.4 73.2 5.5M 0.29G
  • tf_efficientnetv2_b0 convert error
ERROR (tinynn.converter.base) Unsupported ops: aten::ceil
  • mobilevitv2_125 runtime error
ERROR: Node number 147 (CONV_2D) failed to invoke.
ERROR: Invalid tensor index 65535 (not in [0, 4))
~~CPU ptq INT8 by tinynn conversion~~
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 0 0 3.5M 0.40G
efficientformerv2_s1 - 0 0 6.1M 0.65G
efficientformerv2_s2 - 0 0 12.6M 1.25G
EMO_1M - 26.2 22.5 1.3M 0.26G
EMO_2M - 48.3 40.3 2.3M 0.44G
EMO_5M - 34.4 32.4 5.1M 0.90G
EMO_6M - 17.0 16.3 6.1M 0.96G
resnet50 - 60.0 60.7 25.6M 4.1G
  • SwiftFormer_XS convert error
RuntimeError: Given groups=56, weight of size [56, 56, 3, 3], expected input[4, 56, 28, 28] to have 3136 channels, but got 56 channels instead
  • edgenext_xx_small convert error
NotImplementedError: Could not run 'quantized::add_scalar' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::add_scalar' is only available for these backends: [QuantizedCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
  • mobilevitv2_050 convert error
NotImplementedError: Could not run 'aten::_slow_conv2d_forward' with arguments from the 'QuantizedCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_slow_conv2d_forward' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
  • mobilevit_xx_small convert error
RuntimeError: shape '[3584, 2, 14, 2]' is invalid for input of size 262144
  • LeViT_128S convert error
RuntimeError: Output 0 of PermuteBackward0 is a view and its base or another view of its base has been modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
  • mobilenetv3_large_100 convert error
RuntimeError: Error(s) in loading state_dict for QMobileNetV3:
        Unexpected key(s) in state_dict: "blocks_0_0_bn1.running_mean", "blocks_0_0_bn1.running_var", "blocks_0_0_bn1.num_batches_tracked". 
        size mismatch for blocks_0_0_bn1.weight: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8, 8, 1, 1]).
        size mismatch for blocks_0_0_bn1.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]).
CPU fp32/fp16/bf16 by onnx-tf conversion
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 76.3 76.0 3.5M 0.40G
efficientformerv2_s1 - 78.8 79.6 6.1M 0.65G
efficientformerv2_s2 - 82.1 82.0 12.6M 1.25G
SwiftFormer_XS - 76.2 75.2 3.5M 0.4G
SwiftFormer_S - 78.4 78.2 6.1M 1.0G
SwiftFormer_L1 - 80.6 81.8 12.1M 1.6G
edgenext_xx_small - 70.8 70.7 1.3M 0.26G
edgenext_x_small - 74.8 74.8 2.3M 0.54G
edgenext_small/usi - 80.6 79.9 5.6M 1.26G
mobilevitv2_050 - 69.9 66.6 1.4M 0.5G
mobilevitv2_075 - 75.1 74.3 2.9M 1.0G
mobilevitv2_100 - 77.9 76.9 4.9M 1.8G
mobilevitv2_125 - 79.2 80.7 7.5M 2.8G
mobilevitv2_150 - 80.9 81.8 10.6M 4.0G
mobilevitv2_175 - 80.7 81.0 14.3M 5.5G
mobilevitv2_200 - 82.0 83.1 18.4M 7.2G
mobilevit_xx_small - 68.9 66.5 1.3M 0.36G
mobilevit_x_small - 74.0 73.7 2.3M 0.89G
mobilevit_small - 77.6 77.9 5.6M 2.0 G
LeViT_128S - 79.4 76.1 7.8M 0.30G
LeViT_128 - 79.6 78.1 9.2M 0.41G
LeViT_192 - 78.5 79.6 11 M 0.66G
LeViT_256 - 81.1 81.4 19 M 1.12G
resnet50 - 79.6 81.3 25.6M 4.1G
mobilenetv3_large_100 - 75.6 75.3 5.5M 0.29G
tf_efficientnetv2_b0 - 78.2 76.7 7.1M 0.72G
tf_efficientnetv2_b1 - 79.4 79.2 8.1M 1.2G
tf_efficientnetv2_b2 - 81.7 80.4 10.1M 1.7G
tf_efficientnetv2_b3 - 81.8 82.3 14.4M 3.0G
CPU dynamic int8 by onnx-tf conversion
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 45.8 41.0 3.5M 0.40G
efficientformerv2_s1 - 54.6 50.7 6.1M 0.65G
efficientformerv2_s2 - 59.1 56.9 12.6M 1.25G
SwiftFormer_XS - 75.6 75.1 3.5M 0.4G
SwiftFormer_S - 77.6 78.5 6.1M 1.0G
SwiftFormer_L1 - 80.2 81.7 12.1M 1.6G
edgenext_xx_small - 69.4 69.2 1.3M 0.26G
edgenext_x_small - 74.0 74.6 2.3M 0.54G
edgenext_small/usi - 80.0 80.6 5.6M 1.26G
mobilevitv2_050 - 65.8 59.4 1.4M 0.5G
mobilevitv2_075 - 69.9 69.1 2.9M 1.0G
mobilevitv2_100 - 67.4 66.0 4.9M 1.8G
mobilevitv2_125 - 78.3 78.2 7.5M 2.8G
mobilevitv2_150 - 80.6 78.8 10.6M 4.0G
mobilevitv2_175 - 80.4 79.6 14.3M 5.5G
mobilevitv2_200 - 67.8 63.9 18.4M 7.2G
mobilevit_xx_small - 68.0 66.9 1.3M 0.36G
mobilevit_x_small - 48.3 43.8 2.3M 0.89G
mobilevit_small - 73.7 74.7 5.6M 2.0 G
LeViT_128S - 72.8 72.8 7.8M 0.30G
LeViT_128 - 77.1 76.3 9.2M 0.41G
LeViT_192 - 78.5 77.9 11 M 0.66G
LeViT_256 - 78.6 78.6 19 M 1.12G
resnet50 - 79.7 80.3 25.6M 4.1G
mobilenetv3_large_100 - 75.1 73.9 5.5M 0.29G
tf_efficientnetv2_b0 - 78.2 77.0 7.1M 0.72G
tf_efficientnetv2_b1 - 79.2 79.2 8.1M 1.2G
tf_efficientnetv2_b2 - 81.6 80.3 10.1M 1.7G
tf_efficientnetv2_b3 - 81.3 82.6 14.4M 3.0G
CPU ptq static int8 by onnx-tf conversion
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 6.8 5.2 3.5M 0.40G
efficientformerv2_s1 - 15.8 12.5 6.1M 0.65G
efficientformerv2_s2 - 24.2 19.9 12.6M 1.25G
SwiftFormer_XS - 70.1 68.9 3.5M 0.4G
SwiftFormer_S - 49.5 43.8 6.1M 1.0G
SwiftFormer_L1 - 61.4 60.7 12.1M 1.6G
EMO_1M - 24.0 13.1 1.3M 0.26G
EMO_2M - 67.2 62.3 2.3M 0.44G
EMO_5M - 12.3 11.5 5.1M 0.90G
EMO_6M - 28.1 17.5 6.1M 0.96G
edgenext_xx_small - 60.6 57.6 1.3M 0.26G
edgenext_x_small - 60.3 56.9 2.3M 0.54G
edgenext_small/usi - 48.9 39.1 5.6M 1.26G
mobilevitv2_050 - 2.4 2.6 1.4M 0.5G
mobilevitv2_075 - 1.6 0.7 2.9M 1.0G
mobilevitv2_100 - 3.2 1.1 4.9M 1.8G
mobilevitv2_125 - 23.3 22.3 7.5M 2.8G
mobilevitv2_150 - 15.7 11.5 10.6M 4.0G
mobilevitv2_175 - 12.8 12.0 14.3M 5.5G
mobilevitv2_200 - 28.8 25.0 18.4M 7.2G
mobilevit_xx_small - 0.1 2.0 1.3M 0.36G
mobilevit_x_small - 6.6 4.3 2.3M 0.89G
mobilevit_small - 18.4 14.6 5.6M 2.0 G
LeViT_128S - 58.4 54.0 7.8M 0.30G
LeViT_128 - 75.5 75.0 9.2M 0.41G
LeViT_192 - 77.2 77.2 11 M 0.66G
LeViT_256 - 78.3 78.0 19 M 1.12G
resnet50 - 0 0 25.6M 4.1G
mobilenetv3_large_100 - 69.4 65.6 5.5M 0.29G
tf_efficientnetv2_b0 - 77.1 76.0 7.1M 0.72G
tf_efficientnetv2_b1 - 78.0 77.3 8.1M 1.2G
tf_efficientnetv2_b2 - 78.4 78.0 10.1M 1.7G
tf_efficientnetv2_b3 - 79.9 79.3 14.4M 3.0G
⚠️ **GitHub.com Fallback** ⚠️