Keypoint Detection research - dnum-mi/basegun-ml GitHub Wiki

Common Use Cases of Keypoint Detection

Keypoint detection is a machine learning task that aims to detect specific points on an object within an image or video. The most common use cases for keypoint detection are pose detection and face landmarks detection.

  • Pose detection involves detecting human poses and is used to track movements.\

Models

Developing a model from scratch requires a tremendous amount of data and computational power. Therefore, we chose to explore existing models and tune them for our use case. Here are some examples of models for keypoint detection. We compared these different models according to the following parameters:

  • AP50 COCO is the average precision of the model on the Coco dataset with an OKS threshold of 0.5. It means that the keypoint is considered precise when the OKS is higher than 0.5, then the average precision is calculated.\
Model AP50 COCO Scientific Paper Ease of Training Model Dataset Format Documentation or Available Tutorials Other Resources Comments
YOLOV8 91.2 // Very easy YOLO Very complete https://docs.ultralytics.com/tasks/pose/
Pytorch RCNN 87.3 https://arxiv.org/pdf/1703.06870.pdf Easy Dérivé YOLO Very complete https://medium.com/@alexppppp/how-to-train-a-custom-keypoint-detection-model-with-pytorch-d9af90e111da
RSN 94.4 https://arxiv.org/pdf/2003.04030.pdf Average COCO Average https://github.com/caiyuanhao1998/RSN No pretrained model available
HRNet 90.8 https://arxiv.org/pdf/1902.09212.pdf Average COCO Average https://github.com/leoxiaobin/deep-high-resolution-net.pytorch GPU needed even for inference
Simple Base+ 89.6 https://arxiv.org/pdf/1804.06208.pdf Average COCO Average https://github.com/Microsoft/human-pose-estimation.pytorch GPU needed even for inference
MSPN 91.8 https://arxiv.org/pdf/1901.00148.pdf Average COCO Average https://github.com/megvii-research/MSPN No pretrained model available
Poseur 91.6 https://arxiv.org/pdf/2201.07412.pdf Average COCO Quite complete https://github.com/aim-uofa/poseur

After considering various models and conducting several inference trials, we have selected YOLOV8 as the model for our use case.

A Deep Dive into YOLOv8

How Does It Work?

YOLOv8, or "You Only Look Once V8," is a computer vision algorithm used for object detection in images and videos. The name "You Only Look Once" signifies that the algorithm processes the entire image or video in a single forward pass, rather than dividing it into smaller regions or windows. This feature allows it to be very fast, making it a popular choice for detection on live video feeds such as autonomous vehicles or security cameras.

YOLO offers various tasks such as:

  • Object detection
  • Object segmentation
  • Object classification
  • Object pose detection

For our use case, which involves keypoint detection on weapons, the most relevant task is the last one. The Ultralytics library provides access to YOLO models and methods for training them, evaluating them, and even preparing datasets, which is very convenient. You can find more information here. The pose detection model predicts a bounding box of the object and the keypoints associated with the object. It also performs classification if it is a multiclass problem.

How to Train a YOLOv8 Pose Detection Model on a Custom Dataset?

Given our need for speed in our use case, we will be working with the nano and small models.

Dataset Preparation

To train the model on a custom dataset, the dataset must adhere to a specific structure and format.

Each image in the dataset has a corresponding label file with the following format.
\

  1. Filename matching with image
  2. Class ID (only one class in our use case)
  3. Bounding box coordinates (Xcenter, Ycenter, Width, Height)
  4. Keypoint coordinates (X,Y)

Data Augmentation

With YOLO, you can perform data augmentation transformations on the training dataset. The available data augmentation transformations include:

  • Color transformations such as Hue, Saturation, and Value.
  • Geometric transformations, such as Rotate, Translate, Zoom, Shearing, Flip.

YOLO also offers specific transformations such as Mosaic transformation and Mix-up, which are described below.
Mosaic and Mix-up

Model Size

For each task, there are different model sizes, which are named after T-shirt sizes. For instance, for pose detection, there are:

  • YOLOV8-pose Nano, Small, Medium, Large, Extra Large. The larger the model, the more computational resources it requires for training, and the longer the inference time will be. However, the precision will also increase, as shown here.

Loss Functions and Metrics

The YOLOV8 model calculates different losses and performs a weighted sum to obtain a global loss used for backpropagation. The different losses include:

  • Box_Loss: Used for bounding box detection and based on IOU
  • DFL Loss: Distribution Focal Loss used for bounding box (not mandatory)
  • Cls_Loss: Classification loss based on BCE
  • Pose_Loss: Keypoint detection usually based on OKS
  • Kobj_Loss: Keypoint objectness used in the case of keypoint prediction with confidence score

The weights of the different losses can be adjusted depending on each use case.

Hyperparameters

For the training, you can tune YOLOV8 hyperparameters such as:

  • Optimizer
  • Batch size
  • Number of epochs
  • Image size Other parameters can be found here.

Model Outputs

The training outputs of the YOLO library are very convenient. They include:

  • Training curves in various formats: CSV, PNG, TensorBoard
  • A YAML file containing all the hyperparameters used for the training
  • Weights of the trained model: The best on the validation dataset and the weights from the last epoch
  • Predicted images of the validation dataset

For the training of YOLOV8, you can check out this page