Design choices & particular implementation - andriilitvynchuk/patch-parallel-detection-framework GitHub Wiki

Design

The general pipeline logic is to parallelize everything we can so the final speed will be maximum over time of each separate process instead of their sum. SimpleRunner and SimpleRunnerManager are the main objects that help to achieve this goal. They will be explained in detail later (e.g. how you can use them by yourself even in other projects), for now, I will just write down the main concept:

Each Runner is an independent node that does its job, then it possibly shares its results with other nodes. All data is shared via a dictionary. So if the Runner doesn't need anything from that dictionary but it puts some results in there it can be called a pure producer. Usually, it is the first link when we get data from the videos/cameras but it possibly can be other sensors.
All Runners are connected via Queue(1) and it means our system becomes bounded by the performance of the slowest node which is expected (usually it is ML models).
We have GPU and CPU operations. All GPU operations can be batched and computed in 1 process due to high parallelization of GPU (it takes the main job) so we generally need only 1 process for such operations. That's where SimpleRunner comes in, it can work both with GPU and CPU operations but is designed specifically for GPU ones. If you calculate something CPU heavy in that SimpleRunner it just doesn't make sense - we need to do it with a for loop and we lose performance.
For CPU operations SimpleRunnerManager is specifically designed. It manages B processes (where B is the same as a number of cameras and our batch). For example, if we want to send batched images (B, C, H, W) from SimpleRunner to SimpleRunnerManager then each process inside will get the (C, H, W) image. That's where comes the neat part: we need to carefully set up the pipes for SimpleRunnerManager and watch out that no GPU tensor is sent to each subprocess. It can blow up the GPU because if the subprocess receives a GPU tensor it needs to initialize CUDA which takes VRAM.
We define the whole pipeline one time on the initializing part. We set up all the data flow from producers to processing nodes and pure consumers (usually final nodes that visualize things / send events)
Shared memory for everything we can. It is mandatory to use it for images in GPU memory, without it the pipeline becomes too slow. PyTorch has it implemented for GPU tensors, for CPU tensors we can use the shared_memory library (Python <3.8) or shared memory built-in (Python >=3.8).

Particular implementation

ReadImagesToBatchRunner

We read the images with OpenCV right now though it can be done with TensorStream which is much faster. The image from each camera is read in the thread, for videos we process each frame, and for stream - the latest accessible. If the connection with the video/camera breaks we try to reconnect it once per reconnect_time. We batch the images and send them to GPU and CPU shared memory. Then we cut the image into Ncrops to process high-resolution images (objects are small).

DetectionBatchRunner

We get GPU tensors from ReadImagesToBatchRunner, filter out black images (from broken connections), and "lazy" cameras (no predictions lead to defining the cameras as blank for next lazy_mode_time seconds). We run detection per each crop, then use crop_meta to merge them back into the big image. The detector is YoloV5 which was trained on 8 classes.

TrackerRunnerManager

Each process uses a SORT algorithm for the tracking of the objects. We add track to bboxes and using tracks we can also smooth the label for each box for precise predictions. Smoothing is just the median(predictions) for last buffer_size observations.

VisualizationRunnerManager

Each process takes bboxes_with_tracks and images_cpu (unbatched to a single object for each process) and draws the bbox with the color of the class + track on top of the image. Then we can write it to the video if the option is specified.