nvidia tensorrt - YingkunZhou/EdgeTransformerBench GitHub Wiki

  • 最新的commit的需要torch的版本是2.3~2.4,而且还是要带cuda的torch,英伟达官方最多只提供到2.2,所以你要我到哪里找?
  • 2.2版本的torch-tensorrt,pip --no-deps能装上,但是import后就有未识别的符号,而且google不到解决方案,另外2.x的版本明确说了,别人支持的是cuda-12.x,而不是我现在仅有的cuda-11.4

所以只能用1.4版本的了

Installation — Torch-TensorRT documentation (pytorch.org)

github的repo和pytorch的网页都有编译命令,选了一个靠谱的就行,主要是修改WORKSPACE的内容。

TensorRT/docsrc/user_guide/saving_models.rst at main · pytorch/TensorRT (github.com)

In Torch-TensorRT 1.X versions, the primary way to compile and run inference with Torch-TensorRT is using Torchscript IR. For ir=ts, this behavior stays the same in 2.X versions as well.

Torch-TensorRT 2.X

Dynamo IR

The output type of ir=dynamo compilation of Torch-TensorRT is torch.export.ExportedProgram object by default. In addition, we provide a new parameter output_format in the CompilationSetting object provided before compilation. The output_format can take the following options

  • exported_program (or) ep : This is the default. Returns an ExportedProgram
  • torchscript (or) ts : This returns a TorchScript module
  • graph_module (or) fx : This returns a torch.fx.GraphModule which can be traced into Torchscript to save to disk.

onnxruntime

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++
./build.sh --config Release --build_shared_lib --parallel --skip_tests --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --use_tensorrt --tensorrt_home /usr/local/cuda --cmake_extra_defines onnxruntime_DISABLE_FLOAT8_TYPES=ON
/media/loongson/phd19/home/zhou/graduate9/work/update/onnxruntime/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu(78): error: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
# 什么鬼嘛
diff --git a/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu b/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu
index 098e361..98290db 100644
--- a/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu
+++ b/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu
@@ -75,7 +75,7 @@ __global__ void kgemm_4bit_inference_naive(
   uint8_t local_B_4bit[num_values_8bit];
   T local_B[num_values_4bit / 4];
   T local_A[num_values_4bit / 4];
-  __shared__ T quant_map[16];
+  T quant_map[16];
   T local_absmax = T(0.0f);
 
   for (int i = threadIdx.x; i < 16; i++) quant_map[i] = T(datatype[i]);
DLA int8
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 28.1 25.7 3.5M 0.40G
efficientformerv2_s1 - 0 0 6.1M 0.65G
efficientformerv2_s2 - 73.3 70.3 12.6M 1.25G
SwiftFormer_XS - 65.8 64.4 3.5M 0.4G
SwiftFormer_S - 11.0 4.6 6.1M 1.0G
SwiftFormer_L1 - 40.2 42.5 12.1M 1.6G
EMO_1M - 51.5 46.1 1.3M 0.26G
EMO_2M - 44.3 32.9 2.3M 0.44G
EMO_5M - 64.1 56.5 5.1M 0.90G
EMO_6M - 63.2 60.0 6.1M 0.96G
edgenext_xx_small - 68.4 67.3 1.3M 0.26G
edgenext_x_small - 75.1 73.7 2.3M 0.54G
edgenext_small/usi - 80.3 80.2 5.6M 1.26G
mobilevitv2_050* - 6.7 6.9 1.4M 0.5G
mobilevitv2_075* - 56.8 49.0 2.9M 1.0G
mobilevitv2_100* - 56.4 49.2 4.9M 1.8G
mobilevitv2_125* - 63.2 60.3 7.5M 2.8G
mobilevitv2_150* - 47.5 36.5 10.6M 4.0G
mobilevitv2_175* - 68.6 64.2 14.3M 5.5G
mobilevitv2_200* - 73.0 71.9 18.4M 7.2G
mobilevit_xx_small - 19.2 20.7 1.3M 0.36G
mobilevit_x_small - 56.8 57.7 2.3M 0.89G
mobilevit_small - 64.6 66.7 5.6M 2.0 G
LeViT_128S - 76.2 75.9 7.8M 0.30G
LeViT_128 - 78.4 77.4 9.2M 0.41G
LeViT_192 - 79.8 80.0 11 M 0.66G
LeViT_256 - 80.6 81.6 19 M 1.12G
resnet50 - 5.8 5.8 25.6M 4.1G
mobilenetv3_large_100 - 65.2 61.0 5.5M 0.29G
tf_efficientnetv2_b0 - 75.8 75.8 7.1M 0.72G
tf_efficientnetv2_b1 - 75.7 76.5 8.1M 1.2G
tf_efficientnetv2_b2 - 76.6 76.3 10.1M 1.7G
tf_efficientnetv2_b3 - 79.8 80.4 14.4M 3.0G

use onnxruntime tensorrt backend to get accuracy

GPU int8
Model Top-1 Top-1
//20 est.
Top-1
//50 est.
#params GMACs
efficientformerv2_s0 - 30.8 28.6 3.5M 0.40G
efficientformerv2_s1 - 0.0 0.0 6.1M 0.65G
efficientformerv2_s2 - 74.0 71.5 12.6M 1.25G
SwiftFormer_XS - 66.2 65.2 3.5M 0.4G
SwiftFormer_S - 25.6 20.5 6.1M 1.0G
SwiftFormer_L1 - 49.2 46.4 12.1M 1.6G
EMO_1M - 61.4 56.8 1.3M 0.26G
EMO_2M - 63.4 59.3 2.3M 0.44G
EMO_5M - 71.6 71.5 5.1M 0.90G
EMO_6M - 72.7 70.9 6.1M 0.96G
edgenext_xx_small - 71.1 70.5 1.3M 0.26G
edgenext_x_small - 74.5 74.7 2.3M 0.54G
edgenext_small/usi - 80.6 79.7 5.6M 1.26G
mobilevitv2_050* - 11.5 9.0 1.4M 0.5G
mobilevitv2_075* - 61.4 54.7 2.9M 1.0G
mobilevitv2_100* - 58.2 51.1 4.9M 1.8G
mobilevitv2_125* - 65.8 59.2 7.5M 2.8G
mobilevitv2_150* - 42.5 33.3 10.6M 4.0G
mobilevitv2_175* - 70.0 63.3 14.3M 5.5G
mobilevitv2_200* - 74.1 73.1 18.4M 7.2G
mobilevit_xx_small - 27.9 24.6 1.3M 0.36G
mobilevit_x_small - 54.2 56.3 2.3M 0.89G
mobilevit_small - 71.2 72.2 5.6M 2.0 G
LeViT_128S - 76.1 75.7 7.8M 0.30G
LeViT_128 - 78.5 77.4 9.2M 0.41G
LeViT_192 - 79.9 79.7 11 M 0.66G
LeViT_256 - 80.5 81.3 19 M 1.12G
resnet50 - 77.7 79.6 25.6M 4.1G
mobilenetv3_large_100 - 67.6 65.1 5.5M 0.29G
tf_efficientnetv2_b0 - 76.1 75.0 7.1M 0.72G
tf_efficientnetv2_b1 - 75.8 77.1 8.1M 1.2G
tf_efficientnetv2_b2 - 77.0 76.8 10.1M 1.7G
tf_efficientnetv2_b3 - 79.9 81.3 14.4M 3.0G
  • use onnxruntime tensorrt backend to get accuracy
⚠️ **GitHub.com Fallback** ⚠️