TensorRT - lshhhhh/deep-learning-study GitHub Wiki

Introduction: TensorRT 3: Faster TensorFlow Inference and Volta Support

Key highlights of TensorRT 3 include:
- TensorFlow Model Importer: a convenient API to import, optimize and generate inference runtime engines from TensorFlow trained models
- Python API: an easy to use use Python interface for improved productivity
- Volta Tensor Core Support: delivers up to 3.7x faster inference performance on Tesla V100 vs. Tesla P100 GPUs

Why Does Inference Need a Dedicated Solution?

High throughput
Low response time
Power efficient
Deployment-grade solution

Example: Deploying a TensorFlow model with TensorRT (example code)

Import and optimize trained models to generate inference engines We perform this step only once, prior to deployment. We use TensorRT to parse a trained model and perform optimizations for specified parameters such as batch size, precision, and workspace memory for the target deployment GPU. The output of this step is an optimized inference execution engine which we serialize a file on disk called a plan file.
Deploy generated runtime inference engine for inference This is the deployment step. We load and deserialize a saved plan file to create a TensorRT engine object, and use it to run inference on new data on the target deployment platform.

1. Importing a trained model

2. TensorRT Optimizations

Layer and tensor fusion and elimination of unused layers
FP16 and INT8 reduced precision calibration
Target-specific autotuning
Efficient memory reuse

3. Serializing Optimized TensorRT Engines

4. TensorRT Run-Time Inference

RESTful Inference with the TensorRT Container and NVIDIA GPU Cloud

Getting the TensorRT Container

$ docker login nvcr.io
  Username:
  Password:
$ docker pull nvcr.io/nvidia/tensorrt:17.12

맛보기

The container includes all the TensorRT C++ and Python examples.

$ nvidia-docker run -it --rm nvcr.io/nvidia/tensorrt:17.12

To build the C++ you just need to run make. The sample binaries are placed in /workspace/tensorrt/bin.

$ cd /workspace/tensorrt/samples
$ make
$ cd /workspace/tensorrt/bin
$ ./sample_mnist

The TensorRT Python examples are also available and equally easy to execute.

$ cd /workspace/tensorrt/python/examples
$ python mnist_api.py ../data

Using Your Own Model

First you must freeze your TensorFlow model to make it suitable for inference. Let’s assume you’ve saved the frozen model into /home/dev/mymodels/model_frozen.pb. You must also create a labels file. Save this file as /home/dev/mymodels/model_labels.txt. You can see examples of labels file in the container (/workspace/tensorrt_server/*_labels.txt).

Now that you have your trained model and labels you need to make those files available within the container. Do this by using the docker --mount flag to map /home/dev/mymodels on your host system to /tmp/mymodels within the container.

$ nvidia-docker run -p 8000:8000 -it --rm --name mytensorrt --mount type=bind,source=/home/dev/mymodels,target=/tmp/mymodels nvcr.io/nvidia/tensorrt:17.12

/workspace# ls /tmp/mymodels
model_frozen.pb
model_labels.txt

만들어져있는 script 파일(tensorflow/caffe/onnx)을 복사해오고 자신의 model에 맞게 수정하세요. Specify the name of the input and output nodes in your model and also the format of the input data.

/workspace/tensorrt_server# cp tensorflow_resnet tensorflow_mymodel
/workspace/tensorrt_server# cat tensorflow_mymodel
#!/bin/bash

SERVER_EXEC=tensorrt_server
MODEL=/tmp/mymodels/model_frozen.pb
LABELS=/tmp/mymodels/model_labels.txt
INPUT_NAME=myinput
INPUT_FORMAT=float32,3,224,224
OUTPUT_NAME=myoutput
INFER_DTYPE=float16
CMD="$SERVER_EXEC -t tensorflow -d $INFER_DTYPE -i $INPUT_NAME -f $INPUT_FORMAT -o $OUTPUT_NAME -m $MODEL -l $LABELS"

$CMD

Make sure you set INPUT_NAME, INPUT_FORMAT and OUTPUT_NAME appropriately for your model. Now you’re ready to run the REST server using your model. Simply execute the script you just created to start the server.

/workspace/tensorrt_server# bash ./tensorflow_mymodel

NVIDIA TensorRT ™는 NVIDIA GPU에서 고성능 inference을 용이하게하는 C++ 라이브러리입니다. (a high-performance deep learning inference optimizer and runtime for deep learning applications)

TensorRT는 network definition을 받아와서 tensor들과 layer들을 병합하고, weights를 변환하고, 효율적인 intermediate data format을 선택하고, layer parameters와 측정된 성능을 기반으로 대형 kernel catalog에서 선택하는 방법으로 network를 최적화합니다.

TensorRT를 이용하여 모든 딥러닝 프레임워크로부터 학습된 모델을 import할 수 있습니다. 최적화를 적용한 이후에는, TensorRT는 Tesla GPU에서 성능을 최대화하는 특정 kernel에 해당하는 platform(datacenter / Jetson embedded platforms / NVIDIA DRIVE autonomous driving platforms)을 선택합니다. TensorRT는 CUDA platform을 이용하는 모든 NVIDIA GPUs에서 작동합니다. NVIDIA는 production deployment를 위하여 Tesla V100, P100, P4, P40 GPUs를 추천합니다.

Installing TensorRT 3.0.4

https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html

https://github.com/tensorflow/tensorflow/tree/r1.7/tensorflow/contrib/tensorrt

TensorRT에서 제공하는 installation guide를 참고하였습니다.

Getting Started
- TensorRT python API: PyCUDA (Chapter 7. Installing PyCUDA)
- CUDA Toolkit 8.0 or 9.0
- TensorFlow v1.3 with GPU acceleration enabled
- C++ API만 필요하다면 debian packages labeled Python or the whl files를 설치할 필요가 없습니다.
Downloading TensorRT *

Reference

https://developers-kr.googleblog.com/2018/05/tensorrt-integration-with-tensorflow.html https://developer.nvidia.com/tensorrt