Inference Optimization - RecycleAI/RecycleIT-A GitHub Wiki

If you would like to increase your inference speed some options are:

Quantization

Quantization converts 32-bit floating point numbers to 8-bit integers. It performs some or all of the operations on 8-bit integers, which can reduce the model size and memory requirements by a factor of 4.

However, there is a cost to that. In order to reduce the size of the model and improve the execution time, we will sacrifice some precision. So there will be a trade-off between model accuracy and size/latency

Sparsification

Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all-encompassing including everything from inducing sparsity using pruning and quantization to enabling naturally occurring activation sparsity. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, as you will see shortly in our benchmarking exercise, pruning plus quantization can give over 7.3x improvement in performance while recovering to nearly the same baseline accuracy. Additionally, sparsification also reduces the model footprint. In the ResNet-50 example below, we reduced the model size from the original 90.3 MB to 3.3 MB while retaining 99% of the baseline accuracy!

The DeepSparse Platform builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes are YAML or Markdown files that SparseML uses to easily define and control the sparsification of a model. Recipes consist of a series of Modifiers that can influence the training process in different ways.