Triton Inference Server - AshokBhat/ml GitHub Wiki

About

Open-source inference serving software
To deploy AI models
- from TensorFlow, NVIDIA TensorRT, PyTorch, ONNX Runtime
- from local storage or cloud platform
- on CPU and CPU+GPU based
- on server or edge

Hardware support

README says Triton is optimized to provide the best inferencing performance by using GPUs, but it can also work on CPU-only systems

Backends

TensorRT

PyTorch

ONNX Runtime

Supports OpenVINO and TensorRT execution providers.

TensorFlow

On Jetsons, does not use docker images
On non-Jetson systems, use docker images (with NGC TensorFlow docker used by default)

TFLite backend (armnn_tflite)

To run TFLite models on Linux AArch64 machines.
Using ArmNN and XNNPACK TFLite delegates
Backend is at https://gitlab.com/arm-research/smarter/armnn_tflite_backend

OpenVINO

As per https://github.com/triton-inference-server/openvino_backend, it is of BETA quality.
Source: https://github.com/triton-inference-server/openvino_backend
As of Oct 2021, contributions only from NVIDIA employees.

Block diagram

See also

NVIDIA