Literature Review - trap-fish/uav-human-detection GitHub Wiki

Background and Related Work

The integration of lightweight object detection models with autonomous UAVs has emerged as a critical enabler for a range of smart IoT applications, including disaster response and search and rescue (SAR) operations. In such scenarios, rapid and accurate identification of human victims from aerial imagery is vital, yet complicated by factors such as small object size, partial occlusion, cluttered backgrounds, and environmental noise. These challenges have led to the widespread adoption and continued refinement of real-time object detectors, particularly those in the YOLO family, due to their architectural simplicity, low latency, and suitability for edge deployment.

Numerous studies have adapted YOLO architectures for UAV-based detection tasks. Early modifications to YOLOv3 [11] targeted onboard deployment through hardware-aware optimizations [13]. More recent efforts have focused on YOLOv8-based models, such as DEAL-YOLO [9], which introduced multi-objective loss functions and scaled feature fusion mechanisms to improve performance on small aerial targets. Notably, DEAL-YOLO reduced model complexity by eliminating the P5 detection head, enhancing inference efficiency without significantly compromising accuracy. However, these improvements primarily address algorithmic accuracy and ignore deployment constraints on embedded platforms.

Enhancements like the inclusion of a P2 detection head have become common in efforts to improve small-object resolution. Models such as HIC-YOLO [18], based on YOLOv5 [19], and LEAF-YOLO [20], based on YOLOv7 [24], leverage advanced convolutional techniques and specialized feature extractors (e.g., Ghost Convolutions [23], MaxPooling blocks) to address visual density and clutter in UAV imagery. Despite strong detection results, these studies rarely offer insights into real-time deployment metrics such as energy consumption and latency, particularly on low-power devices.

The SeaDroneSee dataset has further inspired research into maritime small object detection, leading to novel architectural refinements [22]. However, these studies often stop short of validating performance on embedded platforms, leaving practical deployment potential unexamined. Similarly, LEAF-YOLO demonstrated promising results on the NVIDIA Jetson AGX Xavier, but did not evaluate energy efficiency—a critical parameter for UAV operations.

Research into edge deployment has largely focused on GPU-based accelerators, which, while performant, impose significant power demands unsuitable for UAV missions. Interest is growing in alternatives such as Neural Processing Units (NPUs), which provide high-throughput, low-power inference capabilities. Work by Achmadiah et al. [27] demonstrated the deployment of YOLOX [28] on the Hailo-8 NPU for train platform monitoring. However, this and similar efforts [29] neither addressed aerial detection challenges nor provided comparative performance analyses across other edge devices.

Low-power alternatives like FPGA-based inference have also been explored. For example, an optimized SkyNet model [31] was successfully deployed on UAVs using FPGAs [30], achieving energy-efficient operation. Nevertheless, such approaches typically emphasize model compression and quantization over generalizability to complex SAR environments.

In summary, while the body of research reflects significant advances in both model design and edge deployment for object detection, few works provide a comprehensive evaluation of lightweight detectors under the real-world constraints of UAV-based SAR operations. There is a notable absence of studies comparing performance, latency, and energy efficiency across widely accessible platforms such as the Raspberry Pi (with CPU-based inference) and NPU-accelerated systems like the Hailo-8. This work aims to address this gap through rigorous benchmarking of YOLO-based models in SAR-relevant settings.

DEAL-YOLO [9]

This study utilized adapted YOLOv8 architecture [17] through a multi-objective loss function, WiseIOU alongside modelling of bounding boxes as 2D Gaussian distributions using Normalised Wasserstein Distance to measure similarities of bounding boxes and ground truth annotations. This helped localisation of objects and reduced abrupt deviations by prioritizing pixels towards the centre of the bounding box so smaller objects were better accounted for. Additionally, Linear Deformable convolutions were added between the C2F blocks in the backbone, which helps the model handle distorted and irregular shapes better through convolutional kernels which can dynamically adapt based on local feature variations. Since the more computationally expensive P2 head was added, which improves small object detection; the P5 detection head, which is often utilized for large object detection, as this is less useful in aerial imagery. SPPF (Spatial Pyramid Pool-Fast) was performed on the P4 head (since P5 was removed) which reduced the number of channels fed into the SPPF block, while only the most relevant feature maps were retained, helping to balance the accuracy vs. speed trade off. Overall they improved the mAP50 metric on BuckTales dataset vs. YOLOv8n by 48.5% vs. 42.8% using single stage inference. However on this dataset the standard YOLOv5n recorded 48.7% mAP50 and while other models had higher accuracies, the total number of parameters for DEAL-YOLO was $<1M$. However, total parameters does not always translate to faster inference, but consideration to latency and deployment in a real-world setting was observed.

HIC-YOLO [18]

HIC-YOLO was proposed in [18], which is a YOLOv5 based architecture, but with an added P2 detection head which enabled detection on higher resolution, fined detailed feature maps. Additionally, an involution block (Channel Feature Fusion with Involution, CFFI) was added at the beginning of the neck which improves the Pyramid Attention Network (PANet) performance. The PANet plays a key role in feature fusion when generating feature maps and has been shown to enhance spatial localization by passing signals back up, fusing finer details from shallow layers; thus this module was added to improve feature extraction for small objects. Finally, a Convolutional Block Attention Module (CBAM) was added, though unlike other works, this was added to the backbone instead of the neck which had the benefit of reduced parameters size. Consisting of a Channel Attention module and a Spatial Attention module, CBAM algorithm works during feature extraction by essentially highlighting significant features along the channel and spatial axes, while suppressing irrelevant features [18].

LEAF-YOLO [20]

This was an adapted YOLOv7-T model [24], which tried to address a problem in which the addition of a specialised detection head (such as P2) would fail to realise considerable accuracy gains due to unrefined feature extraction methods in the backbone which result in poor results in crowded scenes or scenes with complex/noisy backgrounds that are typical in UAV images. LEAF-YOLO also uses a specialised small object detection head through the P2 layer, though unlike DEAL-YOLO, did not remove the P5 head. While Max-Pooling combined with Ghost Convolutions (MGC blocks) were added to refine feature extraction in the backbone and increase efficiency. The main contributions were the addition of LEAF and LEAF-T blocks to the head and neck respectively; LEAF blocks would create multiscale feature maps more suited for small object detection, while the LEAF-T block would extract and enhance these features before they are input to the detection head. These were inspired by YOLOv7-T's ELAN block. Overall, the work was extensive, comparing several custom models and baseline models to their own work, showing that their 4.28M parameter model outperforms other recent models on VisDrone2019-DET validation dataset when comparing the AP50 metric. Taking this work further, their model was deployed onto a GPU based edge device.

SWIN Integrated Models [5], [39]

The SWIN transformer model, proposed in [38], demonstrated the potential for transformer-based backbones in vision applications. Vision transformers will extract features through a computationally expensive self-attention mechanism which essentially obtains global context and spatial relationships in each input image. In [38], a shifted window model was proposed in which the self-attention mechanism would be limited to non-overlapping, shifted windows, but also allowed for cross-connections between windows to ensure a global context could be created. This resulted in a model that could obtain global context as well as spatial relationships while handling multi-scaled features, all without the use of convolutional layers, but in a more efficient way than regular transformers. PV-SWIN [39] modified a YOLOv8s architecture to incorporate a SWIN model into the latter stage of the backbone, just before the SPPF module; as in [18], a CBAM block was added, but in the model’s neck, not backbone. Overall, this architecture obtained almost 5% improvement in average precision, showing good results on partially occluded, small items. However, there was no mention of latency, only that resource usage would need to be addressed.
STF-YOLO also utilized a YOLOv8 architecture but replaced the C3 convolutional blocks in the backbone with SWIN transformer blocks to enhance feature fusion. Several other components were added throughout the model, but collectively, the STF-YOLO model improved mAP50 by 4% vs. YOLOv8s, however this was at great expense of latency, with frames per second recorded at 61FPS vs 277FPS, suggesting that these models require further research before being capable of for real-time inference.

Summary

Overall, there is a magnitude of research which covers various model architectures specialised for small object detection, typically from UAV imagery. However, there is a gap presented in deployment of these models, which are often only tested on equipment similar to that used for training, or perhaps on a GPU based edge device, which can consume energy at rates unsuited for integration with battery powered drones. As such, this work will consider some of the above-mentioned architecture modifications, but deployed onto an neural processing unit (NPU based device) in addition to some CPU optimised formats; with the goal of assessing the suitability of such models to longer range, search and rescue operations.