Torch-TensorRT 1.4

最新的commit的需要torch的版本是2.3~2.4，而且还是要带cuda的torch，英伟达官方最多只提供到2.2，所以你要我到哪里找？
2.2版本的torch-tensorrt，pip --no-deps能装上，但是import后就有未识别的符号，而且google不到解决方案，另外2.x的版本明确说了，别人支持的是cuda-12.x，而不是我现在仅有的cuda-11.4

所以只能用1.4版本的了

Installation — Torch-TensorRT documentation (pytorch.org)

github的repo和pytorch的网页都有编译命令，选了一个靠谱的就行，主要是修改WORKSPACE的内容。

TensorRT/docsrc/user_guide/saving_models.rst at main · pytorch/TensorRT (github.com)

In Torch-TensorRT 1.X versions, the primary way to compile and run inference with Torch-TensorRT is using Torchscript IR. For ir=ts, this behavior stays the same in 2.X versions as well.

Torch-TensorRT 2.X

Dynamo IR

The output type of ir=dynamo compilation of Torch-TensorRT is torch.export.ExportedProgram object by default. In addition, we provide a new parameter output_format in the CompilationSetting object provided before compilation. The output_format can take the following options

exported_program (or) ep : This is the default. Returns an ExportedProgram
torchscript (or) ts : This returns a TorchScript module
graph_module (or) fx : This returns a torch.fx.GraphModule which can be traced into Torchscript to save to disk.

onnxruntime

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++
./build.sh --config Release --build_shared_lib --parallel --skip_tests --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --use_tensorrt --tensorrt_home /usr/local/cuda --cmake_extra_defines onnxruntime_DISABLE_FLOAT8_TYPES=ON

/media/loongson/phd19/home/zhou/graduate9/work/update/onnxruntime/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu(78): error: dynamic initialization is not supported for a function-scope static __shared__ variable within a __device__/__global__ function
# 什么鬼嘛

diff --git a/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu b/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu
index 098e361..98290db 100644
--- a/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu
+++ b/onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu
@@ -75,7 +75,7 @@ __global__ void kgemm_4bit_inference_naive(
   uint8_t local_B_4bit[num_values_8bit];
   T local_B[num_values_4bit / 4];
   T local_A[num_values_4bit / 4];
-  __shared__ T quant_map[16];
+  T quant_map[16];
   T local_absmax = T(0.0f);
 
   for (int i = threadIdx.x; i < 16; i++) quant_map[i] = T(datatype[i]);

DLA int8

Model	Top-1	Top-1 //20 est.	Top-1 //50 est.	#params	GMACs
efficientformerv2_s0	-	28.1	25.7	3.5M	0.40G
efficientformerv2_s1	-	0	0	6.1M	0.65G
efficientformerv2_s2	-	73.3	70.3	12.6M	1.25G

SwiftFormer_XS	-	65.8	64.4	3.5M	0.4G
SwiftFormer_S	-	11.0	4.6	6.1M	1.0G
SwiftFormer_L1	-	40.2	42.5	12.1M	1.6G

EMO_1M	-	51.5	46.1	1.3M	0.26G
EMO_2M	-	44.3	32.9	2.3M	0.44G
EMO_5M	-	64.1	56.5	5.1M	0.90G
EMO_6M	-	63.2	60.0	6.1M	0.96G

edgenext_xx_small	-	68.4	67.3	1.3M	0.26G
edgenext_x_small	-	75.1	73.7	2.3M	0.54G
edgenext_small/usi	-	80.3	80.2	5.6M	1.26G

mobilevitv2_050*	-	6.7	6.9	1.4M	0.5G
mobilevitv2_075*	-	56.8	49.0	2.9M	1.0G
mobilevitv2_100*	-	56.4	49.2	4.9M	1.8G
mobilevitv2_125*	-	63.2	60.3	7.5M	2.8G
mobilevitv2_150*	-	47.5	36.5	10.6M	4.0G
mobilevitv2_175*	-	68.6	64.2	14.3M	5.5G
mobilevitv2_200*	-	73.0	71.9	18.4M	7.2G

mobilevit_xx_small	-	19.2	20.7	1.3M	0.36G
mobilevit_x_small	-	56.8	57.7	2.3M	0.89G
mobilevit_small	-	64.6	66.7	5.6M	2.0 G

LeViT_128S	-	76.2	75.9	7.8M	0.30G
LeViT_128	-	78.4	77.4	9.2M	0.41G
LeViT_192	-	79.8	80.0	11 M	0.66G
LeViT_256	-	80.6	81.6	19 M	1.12G

resnet50	-	5.8	5.8	25.6M	4.1G

mobilenetv3_large_100	-	65.2	61.0	5.5M	0.29G
tf_efficientnetv2_b0	-	75.8	75.8	7.1M	0.72G
tf_efficientnetv2_b1	-	75.7	76.5	8.1M	1.2G
tf_efficientnetv2_b2	-	76.6	76.3	10.1M	1.7G
tf_efficientnetv2_b3	-	79.8	80.4	14.4M	3.0G

use onnxruntime tensorrt backend to get accuracy

GPU int8

Model	Top-1	Top-1 //20 est.	Top-1 //50 est.	#params	GMACs
efficientformerv2_s0	-	30.8	28.6	3.5M	0.40G
efficientformerv2_s1	-	0.0	0.0	6.1M	0.65G
efficientformerv2_s2	-	74.0	71.5	12.6M	1.25G

SwiftFormer_XS	-	66.2	65.2	3.5M	0.4G
SwiftFormer_S	-	25.6	20.5	6.1M	1.0G
SwiftFormer_L1	-	49.2	46.4	12.1M	1.6G

EMO_1M	-	61.4	56.8	1.3M	0.26G
EMO_2M	-	63.4	59.3	2.3M	0.44G
EMO_5M	-	71.6	71.5	5.1M	0.90G
EMO_6M	-	72.7	70.9	6.1M	0.96G

edgenext_xx_small	-	71.1	70.5	1.3M	0.26G
edgenext_x_small	-	74.5	74.7	2.3M	0.54G
edgenext_small/usi	-	80.6	79.7	5.6M	1.26G

mobilevitv2_050*	-	11.5	9.0	1.4M	0.5G
mobilevitv2_075*	-	61.4	54.7	2.9M	1.0G
mobilevitv2_100*	-	58.2	51.1	4.9M	1.8G
mobilevitv2_125*	-	65.8	59.2	7.5M	2.8G
mobilevitv2_150*	-	42.5	33.3	10.6M	4.0G
mobilevitv2_175*	-	70.0	63.3	14.3M	5.5G
mobilevitv2_200*	-	74.1	73.1	18.4M	7.2G

mobilevit_xx_small	-	27.9	24.6	1.3M	0.36G
mobilevit_x_small	-	54.2	56.3	2.3M	0.89G
mobilevit_small	-	71.2	72.2	5.6M	2.0 G

LeViT_128S	-	76.1	75.7	7.8M	0.30G
LeViT_128	-	78.5	77.4	9.2M	0.41G
LeViT_192	-	79.9	79.7	11 M	0.66G
LeViT_256	-	80.5	81.3	19 M	1.12G

resnet50	-	77.7	79.6	25.6M	4.1G

mobilenetv3_large_100	-	67.6	65.1	5.5M	0.29G
tf_efficientnetv2_b0	-	76.1	75.0	7.1M	0.72G
tf_efficientnetv2_b1	-	75.8	77.1	8.1M	1.2G
tf_efficientnetv2_b2	-	77.0	76.8	10.1M	1.7G
tf_efficientnetv2_b3	-	79.9	81.3	14.4M	3.0G

use onnxruntime tensorrt backend to get accuracy

nvidia tensorrt - YingkunZhou/EdgeTransformerBench GitHub Wiki

Torch-TensorRT 1.4

Torch-TensorRT 2.X

onnxruntime

⚠️ GitHub.com Fallback ⚠️

nvidia tensorrt - YingkunZhou/EdgeTransformerBench GitHub Wiki

Torch-TensorRT 1.4

Torch-TensorRT 2.X

onnxruntime

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️