Inference Optimization - RecycleAI/RecycleIT-A GitHub Wiki
If you would like to increase your inference speed some options are:
- Use batched inference with https://github.com/ultralytics/yolov5/issues/36
- Reduce --img-size, i.e. 1280 -> 640 -> 320
- Reduce model size, i.e. YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s -> YOLOv5n
- Use half precision FP16 inference with python detect.py --half and python val.py --half
- Use a faster GPUs, i.e.: P100 -> V100 -> A100
- https://github.com/ultralytics/yolov5/issues/251 to ONNX or OpenVINO for up to 3x CPU speedup (https://github.com/ultralytics/yolov5/pull/6613)
- https://github.com/ultralytics/yolov5/issues/251 to TensorRT for up to 5x GPU speedup
- Use a free GPU backends with up to 16GB of CUDA memory:
Quantization
Quantization converts 32-bit floating point numbers to 8-bit integers. It performs some or all of the operations on 8-bit integers, which can reduce the model size and memory requirements by a factor of 4.
However, there is a cost to that. In order to reduce the size of the model and improve the execution time, we will sacrifice some precision. So there will be a trade-off between model accuracy and size/latency
Sparsification
Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all-encompassing including everything from inducing sparsity using pruning and quantization to enabling naturally occurring activation sparsity. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, as you will see shortly in our benchmarking exercise, pruning plus quantization can give over 7.3x improvement in performance while recovering to nearly the same baseline accuracy. Additionally, sparsification also reduces the model footprint. In the ResNet-50 example below, we reduced the model size from the original 90.3 MB to 3.3 MB while retaining 99% of the baseline accuracy!
The DeepSparse Platform builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes are YAML or Markdown files that SparseML uses to easily define and control the sparsification of a model. Recipes consist of a series of Modifiers that can influence the training process in different ways.