Pose Recognition using OpenCV and MediaPipe - ECE-180D-WS-2023/Knowledge-Base-Wiki GitHub Wiki

Pose Recognition using OpenCV and MediaPipe

OpenCV is a python library designed to solve computer vision problems. It provides some tools for image and video processing such as object detection. For example, it can follow a blue object as was done in Lab 1.

MediaPipe is an open-source framework developed by Google which offers cross-platform, customizable Machine Learning solutions for live and streaming media. It offers a number of pre-trained models that can recognize face and body landmarks as well as track objects.

This tutorial will present how OpenCV and MediaPipe can be used to recognize different poses done with a body and face.

Steps to Follow

Setting up OpenCV and MediaPipe
Collecting Data, Training and Evaluating the ML model
Classify the poses

1. Setting up OpenCV and MediaPipe

First we will need to install the independencies:

!pip install mediapipe opencv-python pandas scikit-learn

(For M1/M2: replace mediapipe with mediapipe-silicon)

The python libraries pandas and scikit-learn will later be used to train the data using ML.

import mediapipe as mp
import cv2

mp_drawing = mp.solutions.drawing_utils
mp_holistic = mp.solutions.holistic

MediaPipe’s mp_drawing helps draw the landmarks while mp_holistic gives the landmark detections that will be needed to detect the pose.

import csv
import os
import numpy as np

CSV is used to work with lists of data.

OS is used to work with files.

Numpy is used to work with arrays, lists, and matrices.

2. Collecting Data, Training and Evaluating the ML model

1. Collecting Data

num_coords = len(results.pose_landmarks.landmark)+len(results.face_landmarks.landmark)

landmarks = ['class']
for val in range(1, num_coords+1):
    landmarks += ['x{}'.format(val), 'y{}'.format(val), 'z{}'.format(val), 'v{}'.format(val)]

with open('coords.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(landmarks)

These block codes create a new CSV file “coords.csv” which have 2005 columns:

1 column for the Class name
501*4 columns for the 501 landmark’s of body (pose) and face:
- We multiply by 4 for the x, y, z coordinates and visibility (which means if a landmark is present in the frame or not: can take 0 or 1 values).

“coords.csv” will allow to store all the landmarks collected, and will be used as an input to train the ML model.

Repeat n times

n represents the number of classes/poses your model will be able to recognize.

class_name = "Wakanda Forever"

This represents the class you will be collecting data points for. (You may change “Wakanda Forever” to any class you’d like).

cap = cv2.VideoCapture(0)
# Initiate holistic model
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
    
    while cap.isOpened():
        ret, frame = cap.read()
        
        # Recolor Feed
        image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image.flags.writeable = False        
        
        # Make Detections
        results = holistic.process(image)
        # print(results.face_landmarks)
        
        # face_landmarks, pose_landmarks, left_hand_landmarks, right_hand_landmarks
        
        # Recolor image back to BGR for rendering
        image.flags.writeable = True   
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
        
        # 1. Draw face landmarks
        mp_drawing.draw_landmarks(image, results.face_landmarks, mp_holistic.FACEMESH_TESSELATION, 
                                 mp_drawing.DrawingSpec(color=(80,110,10), thickness=1, circle_radius=1),
                                 mp_drawing.DrawingSpec(color=(80,256,121), thickness=1, circle_radius=1)
                                 )
        
        # 2. Right hand
        mp_drawing.draw_landmarks(image, results.right_hand_landmarks, mp_holistic.HAND_CONNECTIONS, 
                                 mp_drawing.DrawingSpec(color=(80,22,10), thickness=2, circle_radius=4),
                                 mp_drawing.DrawingSpec(color=(80,44,121), thickness=2, circle_radius=2)
                                 )

        # 3. Left Hand
        mp_drawing.draw_landmarks(image, results.left_hand_landmarks, mp_holistic.HAND_CONNECTIONS, 
                                 mp_drawing.DrawingSpec(color=(121,22,76), thickness=2, circle_radius=4),
                                 mp_drawing.DrawingSpec(color=(121,44,250), thickness=2, circle_radius=2)
                                 )

        # 4. Pose Detections
        mp_drawing.draw_landmarks(image, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS, 
                                 mp_drawing.DrawingSpec(color=(245,117,66), thickness=2, circle_radius=4),
                                 mp_drawing.DrawingSpec(color=(245,66,230), thickness=2, circle_radius=2)
                                 )
        # Export coordinates
        try:
            # Extract Pose landmarks
            pose = results.pose_landmarks.landmark
            pose_row = list(np.array([[landmark.x, landmark.y, landmark.z, landmark.visibility] for landmark in pose]).flatten())
            
            # Extract Face landmarks
            face = results.face_landmarks.landmark
            face_row = list(np.array([[landmark.x, landmark.y, landmark.z, landmark.visibility] for landmark in face]).flatten())
            
            # Concate rows
            row = pose_row+face_row
            
            # Append class name 
            row.insert(0, class_name)
            
            # Export to CSV
            with open('coords.csv', mode='a', newline='') as f:
                csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                csv_writer.writerow(row) 
            
        except:
            pass
                        
        cv2.imshow('Raw Webcam Feed', image)

        if cv2.waitKey(10) & 0xFF == ord('q'):
            break

cap.release()
cv2.destroyAllWindows()

In the block of code above, we are capturing the frames from our video feed. We set the thresholds of our mp_holistic solution to 0.5. This means that MediaPipe will consider a landmark tracking successful when it is at least 50% sure. We then use OpenCV to convert the colors of the from BGR (Blue, Green, Red) to RGB (Red, Green, Blue), as OpenCV works with BGR while MediaPipe works with RGB. We also set the image’s writeable property to false to save some memory before processing the image with MediaPipe’s holistic. We store the output into results. Once we are done, we can do the steps in the inverse order.

The next step is to draw the face, hands, and pose (here pose represents the main landmarks on your body like shoulder and hips). This is done using mp_drawing which we defined earlier. It allows us to see all the tessellations of our body on the live feed.

The final step is to extract those same landmarks and store them all in a continuous array and write them to out “coords.csv” file.

📝 In order to close the window, press “q”.

📝 The longer it takes you to close the window, the more data you’ll be able to collect. It will take longer to train the data, but you will get more accurate results. I think 15 seconds of capturing will give acceptable results.

2. Training the ML model

import pandas as pd
from sklearn.model_selection import train_test_split

The train_test_split is used to split data into training data and testing data.

df = pd.read_csv('coords.csv')
X = df.drop('class', axis=1)
y = df['class']

Here X represents all the features from “coords.csv” excluding the Class name while y represents the class name.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

This line splits X and y into two categories: training (70%) and testing (30%) in a random way.

from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import StandardScaler 

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

These are the classification model dependancies.

The make_pipeline function creates a ML pipeline.
The StandardScaler class is used to normalize the data.
The last two lines represent four different classification algorithms.

We will pick the best out of the four algorithms at the end to classify the poses.

pipelines = {
    'lr':make_pipeline(StandardScaler(), LogisticRegression()),
    'rc':make_pipeline(StandardScaler(), RidgeClassifier()),
    'rf':make_pipeline(StandardScaler(), RandomForestClassifier()),
    'gb':make_pipeline(StandardScaler(), GradientBoostingClassifier()),
}

Here, we are creating four different pipelines (another way to look at it is we are creating four different machine Learning models).

fit_models = {}
for algo, pipeline in pipelines.items():
    model = pipeline.fit(X_train, y_train)
    fit_models[algo] = model

We train our four different models with our training data.

3. Evaluating the ML model

from sklearn.metrics import accuracy_score
import pickle

The accuracy_score function is used to evaluate the accuracy of a classifier.

The pickle library is used to save ML models to disk.

for algo, model in fit_models.items():
    yhat = model.predict(X_test)
    print(algo, accuracy_score(y_test, yhat))

We loop over the four ML models, and predict the class of our testing set X_test. We then output the accuracy score of each model which is a number in [0, 1] where 1 means our predictions are 100% accurate (y_test contains the actual expected classes, while y_hat contains the predicted classes).

with open('body_language.pkl', 'wb') as f:
    pickle.dump(fit_models['rf'], f)

We use the pickle library to save our best model, in my case rf (has the highest accuracy score), and save it to the file body_language.pkl. If your other models have higher accuracy score, then change 'rf' to 'lr', 'rc', or 'gb'.

with open('body_language.pkl', 'rb') as f:
    model = pickle.load(f)

Although you don’t have to save the model if you are running all the code above in one go, it is good practice to do so as you don’t want to waster computing power every time training the model if the data hasn’t changed.

3. Classify the poses

cap = cv2.VideoCapture(0)
# Initiate holistic model
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
    
    while cap.isOpened():
        ret, frame = cap.read()
        
        # Recolor Feed
        image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image.flags.writeable = False        
        
        # Make Detections
        results = holistic.process(image)
        
        # face_landmarks, pose_landmarks, left_hand_landmarks, right_hand_landmarks
        
        # Recolor image back to BGR for rendering
        image.flags.writeable = True   
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
        
        # 1. Draw face landmarks
        mp_drawing.draw_landmarks(image, results.face_landmarks, mp_holistic.FACEMESH_TESSELATION, 
                                 mp_drawing.DrawingSpec(color=(80,110,10), thickness=1, circle_radius=1),
                                 mp_drawing.DrawingSpec(color=(80,256,121), thickness=1, circle_radius=1)
                                 )
        
        # 2. Right hand
        mp_drawing.draw_landmarks(image, results.right_hand_landmarks, mp_holistic.HAND_CONNECTIONS, 
                                 mp_drawing.DrawingSpec(color=(80,22,10), thickness=2, circle_radius=4),
                                 mp_drawing.DrawingSpec(color=(80,44,121), thickness=2, circle_radius=2)
                                 )

        # 3. Left Hand
        mp_drawing.draw_landmarks(image, results.left_hand_landmarks, mp_holistic.HAND_CONNECTIONS, 
                                 mp_drawing.DrawingSpec(color=(121,22,76), thickness=2, circle_radius=4),
                                 mp_drawing.DrawingSpec(color=(121,44,250), thickness=2, circle_radius=2)
                                 )

        # 4. Pose Detections
        mp_drawing.draw_landmarks(image, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS, 
                                 mp_drawing.DrawingSpec(color=(245,117,66), thickness=2, circle_radius=4),
                                 mp_drawing.DrawingSpec(color=(245,66,230), thickness=2, circle_radius=2)
                                 )
        # Export coordinates
        try:
            # Extract Pose landmarks
            pose = results.pose_landmarks.landmark
            pose_row = list(np.array([[landmark.x, landmark.y, landmark.z, landmark.visibility] for landmark in pose]).flatten())
            
            # Extract Face landmarks
            face = results.face_landmarks.landmark
            face_row = list(np.array([[landmark.x, landmark.y, landmark.z, landmark.visibility] for landmark in face]).flatten())
            
            # Concate rows
            row = pose_row+face_row

            # Make Detections
            X = pd.DataFrame([row])
            body_language_class = model.predict(X)[0]
            body_language_prob = model.predict_proba(X)[0]
            print(body_language_class, body_language_prob)
            
            # Grab ear coords
            coords = tuple(np.multiply(
                            np.array(
                                (results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_EAR].x, 
                                 results.pose_landmarks.landmark[mp_holistic.PoseLandmark.LEFT_EAR].y))
                        , [640,480]).astype(int))
            
            cv2.rectangle(image, 
                          (coords[0], coords[1]+5), 
                          (coords[0]+len(body_language_class)*20, coords[1]-30), 
                          colors[np.argmax(body_language_prob)], -1)
            cv2.putText(image, body_language_class, coords, 
                        cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2, cv2.LINE_AA)
            
            image = prob_viz(body_language_prob,["Happy","Sad","Wakanda Forever"],image,colors)
            
        except:
            pass
                        
        cv2.imshow('Raw Webcam Feed', image)

        if cv2.waitKey(10) & 0xFF == ord('q'):
            break

cap.release()
cv2.destroyAllWindows()

The beginning of the code here is similar to when we are collecting data.

The # Make detections section is the most important as it detects the pose:

X = pd.DataFrame([row]) — the input row contains all of our coordinates (x, y, z, visibility)
body_language_class = model.predict(X)[0] — predicts what class we have
body_language_prob = model.predict_proba(X)[0] — predicts the probability of each class
print(body_language_class, body_language_prob)

The # Grab ear coords section grabs the ears coordinates and scales them to the screen size. We use those coordinates to display what class is detected close to your ear.

The prob_viz function is described below and represents another fun way to display the classes!

colors = [(245,117,16), (117,245,16), (16,117,245)]
def prob_viz(res, actions, input_frame, colors):
    output_frame = input_frame.copy()
    for num, prob in enumerate(res):
        cv2.rectangle(output_frame, (0,60+num*40), (int(prob*300), 90+num*40), colors[num], -1)
        cv2.putText(output_frame, actions[num] + ' ' + str(res[num]), (0, 85+num*40), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2, cv2.LINE_AA)
        
    return output_frame

The function outputs all three classes (in my case “Happy”, “Sad”, and “Wakanda Forever”) to the video feed with their respective probabilities in an interactive way (the rectangle widens and shrinks depending on the probability of each class). Feel free to modify the colors :)

This is the final result:

References:

OpenCV

MediaPipe

AI Body Language Decoder with MediaPipe and Python in 90 Minutes by Nicholas Renotte

Sign Language Detection using Action Recognition with Python | LSTM Deep Learning Model by Nicholas Renotte