Animal Pose Estimation Using Vision Transformer in Computer Vision - 180D-FW-2023/Knowledge-Base-Wiki GitHub Wiki

This Wiki page aims to provide an overview of using vision transformer for animal pose estimation technology, including its fundamental principles, existing tools and resources, discussion of advantages and challenges and sample code on how to develop fine tune estimation for animals.

Overview of Animal Pose Estimation

Pose estimation is one of the computer vision techniques that aims for detection and identification of spatial position and pose of objects. It is achieved by identifying body joints or body parts in a captured image frame. Unlike object recognition, it produces a skeleton animation and provides precise coordinates of the object rather than the bounding box. Some of the main types of pose estimation such as human pose estimation and object pose estimation. Human pose estimation is used in many virtual reality and human-computer interaction technologies.

Animal pose estimation is an emerging area derived from human pose estimation. It imposes challenges to researchers as there is a wide variety of animal types, each characterized by its own distinct body size and skeleton. However, the development of this technique on animals benefits humans as it provides aid to agriculture, animal research, and animal health improvements by monitoring and analyzing animal behaviors. Machine Learning researchers have been actively developing it by establishing animal pose dataset as well as fine tuning technology tailored for specific animals.

Fundamental Technology Behind Pose Recognition: Transformer and Vision Transformer(Vit)

One of the main technologies behind Pose Recognition is built on top of the idea of a transformer.

The deep learning architect transformer was originally developed for Natural Language Processing (NLP). Some of its widely known applications include the language models ChatGPT and BERT. It is a semi-supervised learning technique, and it works by transforming one sequence into another. It takes input as sequences of tokens and uses an encoder to predict the next sequence. Through iterating in encoder layers, it passes down the predicted sequence to the next layers. and a decoder will generate an output. Noticeably, since it uses an attention mechanism that draws global dependencies between input and output rather than sequential, so it can run multiple sequences in parallel, which also improves scalability.

Now it is also widely used for computer vision. Specifically ViT, a vision transformer, is used for most pose estimation techniques. For image processing, instead of splitting strings in language models, ViT splits an image into fixed-size patches and feeds the resulting sequence of vectors to a standard transformer encoder. It also adds a “classification token” to complete any classification process. The specific architecture of ViT includes:

Split the image into patches
Flatten the patches
Create lower dimensional embedding from the patches
Feed the embeddings to a encoder
Pre-train models with image labels, then supervised on a large dataset
Finetune result dataset for classification

The self attention mechanism of the transformer allows ViT to focus on certain regions of the image depending on the tasks. Due to this unique mechanism, ViT outperforms CNN, which is a traditional method of image processing, by four times in its efficiency and accuracy when trained on a large dataset. However, something to be cautious about ViT is that it might have more dependence on model regularization than CNN. Luckily, pose estimation has readily available datasets that aids the use of ViT with its robustness.

Dataset

To train the model, we first need to identify the dataset. It is possible to use the Common Object in Context(COCO) dataset as a benchmark for development of human pose estimation models. COCO is a large-scale dataset that contains 330 thousand images and 1.5 million object instances. It is a substantial amount of human posture data and is widely used for human pose estimation. For animal pose recognition, AP-10K is typically used. It consists of 10,015 images of 23 animal families and 60 species. It was originally dedicated for animal identification tasks, now can also be used for animal pose estimation.

Utilization of Technology in Pose Estimation

With a better understanding of the fundamental technologies, now it is easier to connect the dots and look at the applications of those technologies in animal pose estimation.

Human Pose Recognition

For human pose recognition, some of the common models and tools used include YOLOv7 Pose and Mediapipe Pose, which also are trained on the COCO dataset. YOLOv7 Pose is developed based off of the seventh version of the You Only Look Once model, and it is a single-stage, multi-person person recognition model. Mediapipe Pose is a single-person recognition framework. Compared to YOLOv7 Pose, it is better at detecting pose for far away objects. Some sample code is attached at the end of this page.

What is noticeable is one of the newest models: ViTPose. ViTPose is developed aiming to maximize the benefits of the transformer model: the attention mechanism. It allows the model to scale up to 1 billion parameters with parallelism that comes with the attention mechanism. It also uses a plain vision transformer, which differs from its predecessor that still depends on CNN for feature extraction. VitPose extends the capacity of plain vision transformers. It incorporates a streamlined decoder, which balances inference speed and performance. As shown in the image below, VitPose adds layers of decoder after the transformer backbone to output the estimated heatmap relative to the keypoints.

Animal Pose Recognition

Most animal pose estimations are developed based on human pose estimation. Some of the possible techniques include: VitPose with YOLOv8 and Animal Body Pose API by Apple Vision.

YOLOv8 is an updated version of YOLO, and it can take the result of VitPose and fine tune it with animal-specific dataset such as the Stanford Dogs Dataset so that it enables animal pose estimation. The Animal Body Pose API by Apple Vision supported in iOS17 and macOS Sonoma. Apple Vision identifies 19 body joints for humans and 25 for animals. It is a complete API that is able to identify animal types and animal pose. It also identifies groups of body parts.

Application of Animal Pose Estimation

Agriculture: Observation on livestock behavior and monitor health
Engineering and Technology: Capturing animal movement for inspiration in biomimetic development
Pet Mental Health: Pet owners would gain better understanding of behavior of pet and their mental states and/or gain better relationship with their pets

Future Directions and Challenges

As we can see, the use of computer vision in animal pose estimation, specifically vision transformer, provides scalability, flexibility, and transferability. If we would like to further maximize the benefits of estimation technology, we can combine the results of pose estimation and the use of object recognition to achieve pose recognition. However, the current research on animal pose estimation is still limited and in the development stage, and with the fast development of deep learning technology, it poses potential challenges to this field as it requires a similar rate of development to keep up with the upgrades.

Conclusion

This Wiki page explains the fundamentals of animal pose estimation with its use and application.The technology allows us to understand the movement of animals for agricultural or healthcare purposes. Thus is it crucial to put more research and resources in this field.

Sample Code

VitPose with PyTorch: https://github.com/ViTAE-Transformer/ViTPose

YOLOv7

git clone https://github.com/RizwanMunawar/yolov7-pose-estimation.git

python pose-estimate.py

#if you want to change source file
python pose-estimate.py --source "your custom video.mp4"

#For CPU
python pose-estimate.py --source "your custom video.mp4" --device cpu

#For GPU
python pose-estimate.py --source "your custom video.mp4" --device 0

#For View-Image
python pose-estimate.py --source "your custom video.mp4" --device 0 --view-img

#For LiveStream (Ip Stream URL Format i.e "rtsp://username:pass@ipaddress:portno/video/video.amp")
python pose-estimate.py --source "your IP Camera Stream URL" --device 0 --view-img

#For WebCam
python pose-estimate.py --source 0 --view-img

#For External Camera
python pose-estimate.py --source 1 --view-img

Mediapipe

!pip install -q mediapipe==0.10.0

!wget -q -O efficientdet.tflite -q https://storage.googleapis.com/mediapipe-models/object_detector/efficientdet_lite0/int8/1/efficientdet_lite0.tflite

# Download test image
!wget -q -O image.jpg https://storage.googleapis.com/mediapipe-tasks/object_detector/cat_and_dog.jpg

IMAGE_FILE = 'image.jpg'

import cv2
from google.colab.patches import cv2_imshow

img = cv2.imread(IMAGE_FILE)
cv2_imshow(img)

# Running inference and visualizing the results
# STEP 1: Import the necessary modules.
import numpy as np
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# STEP 2: Create an ObjectDetector object.
base_options = python.BaseOptions(model_asset_path='efficientdet.tflite')
options = vision.ObjectDetectorOptions(base_options=base_options,
                                       score_threshold=0.5)
detector = vision.ObjectDetector.create_from_options(options)

# STEP 3: Load the input image.
image = mp.Image.create_from_file(IMAGE_FILE)

# STEP 4: Detect objects in the input image.
detection_result = detector.detect(image)

# STEP 5: Process the detection result. In this case, visualize it.
image_copy = np.copy(image.numpy_view())
annotated_image = visualize(image_copy, detection_result)
rgb_annotated_image = cv2.cvtColor(annotated_image, cv2.COLOR_BGR2RGB)
cv2_imshow(rgb_annotated_image)

Reference

"Attention Is All You Need" by Vaswani et al. in 2017.
AP-10K: A Benchmark for Animal Pose Estimation in the Wild by Yu et al. in 2021
ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
https://machinelearningmastery.com/the-transformer-model/
https://supervisely.com/blog/animal-pose-estimation/
https://supervisely.com/blog/vitpose-state-of-the-art-pose-estimation-model-in-supervisely/
https://supervisely.com/blog/train-yolov8-on-custom-data-no-code/
https://learnopencv.com/yolov7-pose-vs-mediapipe-in-human-pose-estimation/
https://www.youtube.com/watch?v=kb03ufEkOdA
https://developers.google.com/mediapipe/solutions/vision/pose_landmarker/python