3. Tutorials - compsygroup/bitbox GitHub Wiki
:construction: UNDER CONSTRUCTION :construction:
Basic Concepts
Facial landmark points
Facial landmarks are a set of points on the face that correspond to features, such as brows, eyes, noise, mouth. By tracking these points throughout a video, we can quantify the facial expressions and behavior at each frame, as the facial features correspond to parts of the face that move with facial expressions. Bitbox provides facial landmarks in 2D and in 3D. The latter include a variant called canonicalized 3D landmarks, which are particularly useful for the analysis of expressions, as they remove the effect of head or body movements, which often occur in naturalistic videos. Also, canonicalized 3D landmarks eliminate the effect of person-specific facial morphology, which is also advantageous for expression analysis, as certain personal characteristics (e.g., lower-than-usual eyebrows, wider-than-usual mouth) can be mistaken by algorithms are expression-related facial deformations (see also Separation of pose, expression and identity).
The default landmark template used in bitbox is the iBUG-51 template, which tracks the brows, eyes, nose and mouth with 51 landmarks.
Head Pose
Head pose refers to the 3 angles that represent the rotation of the head and the 3 coordinates (x,y,z) that represent the location of the head with respect to the camera. Tracking the head pose throughout a video is important for analyzing facial behavior, as the head movements are essential indicators of social communication, used frequently for backchanneling and other purposes. Further, static head pose is also of interest as it relates to social orienting and attention; and some studies have used it as a proxy for eye contact or social gaze.
Facial expressions
Facial expressions and actions are the most heavily studied component of facial behavior, as they are related to social and emotional processes and are modified by mental health conditions, mood, and personality states. Expressions can be emotional, but also communicative. They can be spontaneous or intentional. Bitbox provides several methods to quantify facial expressions.
The method currently suggested for studying expressions is using the per-frame expression vectors provided by 3DMM fitting. These vectors contain 79 expression coefficients that describe the expression-related facial deformation on the entire face. These expression vectors have been validated on clinical data: They have been used to classify autism vs. neurotypical from the 3-5 minute videos; and to quantify social coordination from on a clinical sample of mixed psychiatric presentations (autism, anxiety, depression, ADHD) as well as neurotypical participants. The main advantage of this expression vector is that it describes expression after removing the effect of pose and person-specific facial morphology, which are the two main sources of nuisance in expression analysis (see Separation of pose, expression and identity). Moreover, the coefficients represent the facial morphology densely, through a mesh of ~20,000 points, and thus are capable of capturing movements in a granular manner.
An alternative approach provided by bitbox is to use the canonicalized 3D landmarks (see Facial landmark points). These landmarks represent the expression-related facial deformation, as encoded by the above-described 3DMM expression vectors. An advantage of these landmarks is that they provide movement in a metric space, in terms of millimeters, and therefore can be useful in applications where geometric distances matter. Moreover, users can conduct analyses that require tracking specific landmark points on the face (e.g., lip corners).
Finally, users can use the standard (i.e., non-canonicalized) 3D landmarks in applications where the the isolation of the expression component from pose or person-specific morphology is not required. Similarly, the 2D landmarks can also be used in applications where tracking the overall movement on facial landmarks is of interest, particularly if there is a need to locate the landmark points on the image frame.
Separation of pose, expression, and identity
A fundamental goal in video-based analysis of facial behavior is to quantify facial pose and expression from 2D frames. This task is not trivial, as facial pose and expressions, as well as person-specific identity, are all entangled in a 2D image. For example, if a person who is being recorded from a frontal angle turns their head slightly downwards, the distance between their eyes and their brows will slightly reduce, generating the impression that of a frown. Person-specific identity is a also an important factor, because each person has a unique facial morphology where the shape of facial features (eyes, brows, mouth, nose) as well as the distance between them is different. Thus, if the facial morphology is not parsed out, it can lead to an incorrect estimation of facial expressions. For example, if a person's eyebrows are naturally closer to their eyes, this can generate an incorrect impression of frowning.
A critical advantage of Bitbox is that it quantifies facial behavior based on 3DMM fitting, which is designed to disentangle the three main factors discussed above, namely, facial pose, expression and identity. As such, the default expression coefficients that we provide (see Facial expressions) aim to describe expression after removing the effect of facial pose and expression.
3D face model
The facial behavior analysis in bitbox is based on 3D reconstruction of the face frame-by-frame, as the pose, expression and person-specific morphology are separable in the 3D space (see Separation of pose, expression, and identity). More specifically, every face is represented with a dense 3D mesh of ~20k points. This mesh is universal in that each point corresponds to a specific part of the face. For example, the 10665th point in the mesh corresponds to the left lip corner in any 3D reconstruction. This allows the expressions to be analyzed consistently across people and conditions.
Body joints
Body joints are the analogues of facial landmark points for the body. Specifically, each body joint corresponds to a specific part of the body (left wrist, right ankle, left shoulder), and tracking the body joints throughout the video allows us to quantify the body movements of a person. We detect *** joints on the body, namely: [***]. Currently we provide tracking only on the 2D space.