Introduction - trap-fish/uav-human-detection GitHub Wiki

Background - Object Detection

In it's simplist, object detection is a branch of computer vision in which a target object in an image can be located and its position in the image bounded by four coordinates making up a bounding box, complete with a confidence source as to whether the detection is likely to be what the model has predicted.

This is split into two key branches involving CNN approaches. One is a two stage approach, and the other a one stage approach. A two stage object detector approach will first predict a set of region proposals and secondly a more accurate prediction of the region and class. In contrast, one stage networks, such as YOLOv3 [11] ditch computationally expensive region proposals and predict over regular and dense locations instead, for example using anchor boxes. These predefined anchor boxes of different scales are used to detect an object then modify the anchor box to better fit the ground truth object.

However, anchor boxes are applied to feature maps, not the input image directly. It is in the backbone where features are extracted through several convolutional layers until a feature map is generated. The different layers of the backbone will generate different sizes (scales) of feature maps. In YOLOv5, the backbone consisted of Convolutational, CSPDarknet53 (C3) and Spatial Pyramid Pooling (SPP) modules. A detailed overview on the YOLOv5 model can be seen below, which was copied from the HIC-YOLO paper [18]. These modules together will extract features then generate a feature map. Each layer within the backbone will generate a feature map of a different scale.

yolov5v6

As seen in the image below, the P2 layer will generate a higher resolution feature map (160x160 if the input is 640x640) - this will increase computational cost, but it also provides a finer layer of residual features that allow for smaller objects to be detected. In fact, if the application is mostly focusing on small objects, the P5 layer can be removed entirely [9]

p2head