AWS Inferentia - AshokBhat/ml GitHub Wiki

AWS Inferentia

Machine Learning inference chip
Salient points: High throughput, Low Latency, 100s of TOPS
Each chip supports up to 128 TOPS
Multiple data types: INT8, FP16, BFloat16
Multiple ML frameworks: TensorFlow, MXNet, PyTorch, ONNX

AWS Neuron SDK

SDK to deploy ML inference on Amazon EC2 Inf1 instances
Consists of a compiler, run-time, and profiling tools
Pre-installed in AWS Deep Learning AMIs, AWS Deep Learning Containers and Amazon SageMaker
Can also be installed in your custom environment without a framework

AWS EC2 Inf1

up to 16 AWS Inferentia chips
2nd generation Intel Xeon Scalable processors
up to 100 Gbps networking

Users

AirBnB - PyTorch NLP BERT Models - ChatBot - 2x improvement in throughput
AutoDesk - AI-powered virtual assistant - 4.9x higher throughput over G4dn for NLU models
Snap - Recommendation models
Sprinklr - natural language processing (NLP) and computer vision

Amazon users

Amazon Advertising

Text ad processing - PyTorch based BERT - From GPUs
Image ad processing models

Amazon Alexa

Text-to-speech - lower inference latency and cost-per-inference
Web-based question answering (WBQA) workloads
- Tensorflow-based model
- from GPU-based P3 instances
- inference costs by 60%
- end-to-end latency by more than 40%

Amazon Rekognition

Object classification models
8X lower latency, and 2X throughput compared to GPUs

Machine learning workflow

Building your model in one of the popular machine learning frameworks
Use GPU instances such as P3 or P3dn to train your model
Deploy your model on Inf1 instances by using AWS Neuron SDK

See also

[Groq]] ](/AshokBhat/ml/wiki/[[Habana-Labs) | Graphcore
Google TPU
[AWS EC2]] ](/AshokBhat/ml/wiki/[[AWS-Elastic-Inference)
[AWS Graviton]] ](/AshokBhat/ml/wiki/[[AWS-Inferentia)