Critical Appraisal - trap-fish/uav-human-detection GitHub Wiki

System Design

Hardware Selection

A Raspberry Pi was selected for this due to accessability, cost and large online community for support, additionally many the Hailo-8L NPU device was selected based on it's simple integration with this Raspberry Pi. In a production setting, it would be possible to deploy this system as designed here, however it should be noted that the use of other devices, such as FPGAs or other System on Chip development boards could work with a lower power consumption rate. Hence, such devices are mentioned in future work, though the overall system design for this project made sense at the time and in hindsight, it worked quite well.

Software Selection

A large fraction of the model training was performed on the Ultralytics platform, part of the reason for this was the ease of use which allowed for rapid prototyping of models. However in hindsight, this ecosystem was quite restrictive and did not fit well with other frameworks or stand-alone models. For example to compare models trained in Ultralytics vs. the RT-DETR repository, the Ultralytics YOLO model evaluation JSON results needed to be manually edited to work with Faster COCO Eval. Further more, some bugs were found in the code base during the experimental phase, the most critial of which was how they calculated mean Average Precision, which resulted in inflated metrics. Overall the experience with Ultralytics was positive, a key benefit was the YAML configuration style for defining models which many other frameworks and researchers could adopt; though if I was to start over I would seek to descrease the reliance on this framework.

The Hailo Platform proved to be incredibly user friendly in the end, with relatively good documentation, though some gaps for working with customised models. That said, it was a steep learning curve to learn the ins and outs of this software suite and beyond the basics, it took the best part of a month to get into a rhythm in which a model could be trained and compiled to work on the Hailo8-L. As such, any it would be recommended to start using this software at the earliest possible stage of the project, whereas I only started it after some initial models were selected after experimentation.

Limitations of this solution

Low Light and Noisy/Crowded Scenes

The solution proposed demonstrates a capability for a lightweight and efficient device to detect humans when mounted to a UAV. Of note, it should be pointed out that in low lighting scenes, it was observed these models tended to perform worse and in daylight, as image backgrounds grew more complex the accuracy would decrease. The decision to combine pedestrian and person classes lead to some parked bikes in the image backgrounds appearing as false positives, due to the similarities to a motorbike in the distance regardless of a rider atop or not. Additionally, occlusions present a challenge not just here but in most other works, with several objects partially occluded objects being missed entirely - this would be highly disadvantageous in settings such as forested search regions for example. Overall, many of the models were impacted by the issues highlighted in the LEAF-YOLO paper [20] in which the detection heads for small objects can't capitalise on feature maps generated for complex scenes without further feature extraction refinement. As such, this should be a focus for some future work.

Focus on YOLO Series

While this work explored some new avenues not previously investigated, namely deployment on CPU (using openVINO) of an aerial object detection model, alongside a counterpart model for NPU deployment on the same device; it must be said that focusing just on YOLO based models was a limiting factor. RT-DETRv2 was trained, however due to excessive GPU consumption, long training times and an overly complex compilation to the Hailo platform should it have gotten so far, this model had to be abandoned despite promising results. In hindsight, the compilation of custom models to the Hailo platform was more complicated than expected and particularly so when first getting accustomed to the tool. As such, if things were to be done differently, I'd have started work on the Hailo compilation much earlier, and not only after acceptable training results were obtained.

Power Monitoring

Power consumption on UAVs is highly important and a limiting factor if these can be deployed on SAR operations. The power consumption was measured using a USB monitor and served as a rough estimate for the consumption levels during inference. However, more robust methods should be utilized for this, such as through an INA231 monitor. Additionally, power would need to be recorded for an entire system, including camera video capture and telecommunications; it should also be recorded over a longer period of inference too, to consider performance at higher temperatures.

Average Precision Metric

Average Precision, or AP at 50% IoU is a typical metric used in object detection; likewise the stricter AP at 50-95% IoU, which takes the average precision over all IoU thresholds from 50% to 95%, in increments of 5%. This value is defined as the area under the PR curve for a given threshold. However, in search and rescue, while both precision and recall are important, recall should hold more weight. Since a less precise model might misclassify non-human objects as humans more often, this is more of an inconvenience in SAR applications, or a waste of time at the worst. But if recall is poor, there is a higher chance of a human in the frame but doesn't get detected - this missed detection could cost lives in SAR applications. Therefore, the reliance on AP50 or AP50-95 for this task should be reconsidered.

Future Work

RT-DETR Series

This model performed exceptionally well for an 'out-of-the-box' model with only finetuning on the VisDrone-Humans required to achieve 56% AP50 for the smaller RTDETR-R18 version.

With some architecture adaptions that would better suit this model for small object detection, it could be much more promising than the YOLO series models. One obstacle is the Hailo platform involves a relatively complex process to get custom models running even when there is some support or documentation. For RT-DETR as of May 2025, it doesn't appear this model has ever been deployed on the Hailo device (at least not published) nor do Hailo intend to support it. This is something I would like to investigate further, however.

Future work should pick this up with other transformer-based models such as VisonTransformers (ViTs also have support on Hailo's Model Zoo).

Thermal Imagery

Likewise, as mentioned in the methodology, incredibly high results were observed using IR-Thermal imagery, this is certainly a promising area that could be explored in more detail. One concern with the results observed was the similarity in many frames across the datasets and the lack of noise or crowded scenes. A more challenging IR-thermal dataset could confirm the advantages of using these specialised cameras.

Alternative Devices

The Hailo NPU is an impressive piece of hardware as demonstrated with the latency results shown in the report. However other devices such as FPGAs can allow for more commonly used Neural Network model formats, such as NCNNs, ONNX and now also OpenVINO (if intel based) to be deployed. This would remove a lot of the manual effort required in compiling a custom model to work with a Hailo device. Not only would it be worth considering deploying onto other devices such as FPGAs or ASICs but other