An Introduction to Image Processing, Gesture Recognition, and Hand Tracking - 180D-FW-2023/Knowledge-Base-Wiki GitHub Wiki
An Introduction to Image Processing, Gesture Recognition, and Hand Tracking
This wiki article aims to be a good starting point for those interested in using computer vision and image processing as a means to both recognize gestures and track hands. This article will particularly be exploring this through the use of the OpenCV and MediaPipe library for python.
1. OpenCV, Thresholding, and Blurring
An important first step in gesture recognition when opting for a computer vision approach is filtering, or thresholding. Certain operations and algorithms commonly used to detect gestures, work best on a filtered-monochromatic image, where the hand is the only part of the image unfiltered. To achieve this, one should become familiar with both color representation models, and the OpenCV library, particularly the cv.threshold and cv.MedianBlur functions.
Firstly, a color representation model is simply put, a way to characterize any given color, as a series of numbers that describe aspects of that color. Common color representation models include RGB (red green blue), HSL (hue, saturation, lightness), and HSV (hue, saturation, value).
(1) A visual aide to the representation of color in the HSL and HSV spaces.
(2) A visual aide to the representation of color in the RGB space.
Typically, when working in computer vision, and with thresholding in particular, HSV is chosen as the separation of hue and saturation allows for a more sophisticated threshold to be set; allowing color filtering to be more robust against changing lighting conditions, as well as more accurate in general.
Now that we know what a color representation model is, we can begin using them to our advantage using the OpenCV library and its cv.threshold function. Since we know what we will be looking for in our images and video feeds, hands, we can define a set of HSV values we expect for a skin color. There are ways to make this process more precise, such as basing the range off of a known skin color taken from a facial recognition algorithm, but at its core it revolves around defining a range of HSV values deemed to be acceptable as "skin". Using cv.treshold, we can then filter the individual pixels out of an image, based on wether or not their HSV values fall within our defined ranges; replacing them with white if they do, and black if they do not.
(3) A set of before and after pictures of a hand showcasing the results of cv.threshold being used on the first image. The thresholding range used was 143<H<180, 0<S<255, 0<V<255. This is a very rudimentary choice and was used for education purposes. For more information about cv.threshold, see https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html .
As can be seen, Thresholding does a good job of eliminating background from an image; but depending on non-ideal conditions, and poor thresholding, the output can be rather noisy. This is where the next step comes into play, blurring using OpenCV.
OpenCV offers many blurring techniques for a variety of noise interference. In our case, we see lots of specks in the image, commonly referred to as salt-and-pepper noise; and we should apply median blurring using cv.MedianBlur.
(4) The result of using cv.MedianBlur on the previous images, notice the reduction of black specks on and around the hand.
There are additional steps that can be taken to refine this image, such as refining the threshold values, and additionally using gaussian blur through cv.GaussianBlur to help fill in the larger black splotches between the fingers and the palm. For more information on this process check https://docs.opencv.org/4.x/d4/d13/tutorial_py_filtering.html . However, for the most part, such an image is now suitable for image processing and gesture recognition.
2. The hand skeleton model and MediaPipe.
First, lets introduce the hand skeleton model. Basically, the hand skeleton model is a way to approximate the shape a hand is currently in by assigning points at certain joints, and drawing paths between these assigned points.
(5) an example of how the hand skeleton model can be overlaid onto a monochromatic image
How exactly is this useful? Well, by comparing the x and y values for the positions of these points, against other points, or possibly against past instances of itself, we can both detect certain gestures, such as a curved finger, as well as track hand motion and velocity.
Now that we understand the usefulness of the skeleton model, we must implement it. Luckily, many libraries already exist with such a function already implemented! One such library is MediaPipe.
(6) The MediaPipe representation of the hand skeleton model.
Using the MediaPipe library and its mp.solutions.hands family of functions, we can easily detect hands in a monochromatic image and overlay their hand landmark system over top of it. As mentioned previously, we can track the pixel-position of these landmarks, and compare them with one another, for both hand tracking, and gesture recognition.
MediaPipe has an excellent tutorial for python found here: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker/python#image_2 .
3. Conclusion
By using the OpenCV and MediaPipe libraries as a foundation, a developer can easily create more sophisticated hand tracking and gesture recognition modules; by first performing image pre-processing in OpenCV through the threshold and blur functions, and then performing post-processing functions using the landmark functionality of MediaPipes mp.solutions.hands functions.
References
(1) https://commons.wikimedia.org/wiki/File:Hsl-hsv_models.svg
(2) https://commons.wikimedia.org/wiki/File:RGB_color_solid_cube.png
(3) Own work
(4) Own work
(5) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321080/ figure 9
(6) https://developers.google.com/mediapipe/solutions/vision/hand_landmarker