DMD gaze related action annotation criteria - Vicomtech/DMD-Driver-Monitoring-Dataset GitHub Wiki

The DMD dataset contains events of different nature: distraction, drowsiness, hands and gaze. On this section, we present the criteria with which the currently available annotations of DMD were annotated. This only includes Temporal Gaze-related Annotations.

DMD video streams

The DMD dataset is composed of synchronized video streams from 3 different cameras. Each camera was placed to capture the activity of certain regions of the vehicle's cabin. In particular, they focus on parts of the driver. Namely, there is a stream which captures the body activity, one for the face and head and other to capture the hand's activity. Therefore we name these streams as body, face and hands camera, respectively.

To annotate the recording sessions we created a mosaic video which synchronously merges the body, face and hands camera streams. This mosaic video should be passed to the temporal annotation tool (TaTo) to start annotating the sequence or correct a previously annotated session.

Annotation Levels

The defined levels describe temporal actions or events which occur when the driver is performing some gaze-related actions. To annotate temporal actions, we defined 3 levels of annotations. Basically, there are 3 types of annotations that can simultaneously be present and describe one frame. Each level of annotation has its own set of labels. Within each level, the labels are mutually exclusive, meaning that, for each level a maximum of one label is allowed.

The gaze-related annotation levels are:

gaze-levels

Depending on the annotation level, some require to have a label for each frame in the video, this is represented with a full cell in the above table (Level 1), while the annotations that can have intervals with the absence of labels are represented with a shorter filled cell (Level 0 and 2).

Annotation instructions

The following sections describe the criteria to be taken when annotating gaze-related actions in the DMD dataset.


Level 0: Occlusion in cameras

An occlusion is an event that happens when above 50%-60% of the camera view is covered by the driver's own body or any other object and the scene is not recognizable. Since the dataset contains streams from 3 different cameras and each camera focus on specific parts of the driver (i.e. face, body and hands), special attention should be given to the relevant (objective) part of the driver. This means, for instance, if in the hands video the hands and wheel can not be recognized, then there is an occlusion.

Streams used for annotation

To annotate this level, all three streams (face camera, body camera and hands camera) should be considered equally to assign the corresponding labels.

Labels

If there is a frame where there is an occlusion in one of the cameras, you should label the frame with one of the following labels:

Key             Label             Description
0 Face occlusion Stream from face camera is occluded and cannot recognize the action the driver is performing
1 Body occlusion Stream from body camera is occluded and cannot recognize the action the driver is performing
2 Hands occlusion Stream from hands camera is occluded and cannot recognize the action the driver is performing
Examples

Occlusion

occlusion

No Occlusion

no-occlusion

Special remarks

✔️ It is possible there is some ambiguity when defining if there is an occlusion. Especially in the hands camera, since some actions such as talking to the phone, hair and makeup could occlude part of the scene. However, if is it is possible to certainly recognize the driver actions then it should not be considered as an occlusion.

✔️ In this level, only one camera can be annotated as occluded. We have seen there is not any case in which there is a simultaneous occlusion in two or three video streams.

⚠️ If the current frame has no occlusion in any of the cameras then leave it without label. To clear a label press .


Level 1: Gaze Zone

In this level, it is required to identify the gaze zone at which the driver is looking. In every video, the driver is looking for several seconds at a certain predefined gaze region in the car. The order of the gazing regions is the same for every video. Small blinks during the gazing can have the same annotation as the other frames of the same region. At a transition between regions, try to annotate the current gazing region as much as possible. If this is not possible (for example due to blinking during the transition), the annotation should be the next region to be looked at.

Streams used for annotation

To annotate this level, the face camera is primarily used, although the body camera can be useful to validate the annotation

Labels

Key                 Label                 Description
0 left_mirror The driver is looking at the left outer mirror of the vehicle.         
1 left The driver is looking at the left window of the vehicle.
2 front The driver is looking directly in front, through the front window.
3 center_mirror The driver is looking at the center mirror, seeing the back of the vehicle.
4 front_right The driver is looking to the right side of the front window, at the right side of the center mirror.
5 right_mirror The driver is looking at the right outer mirror of the vehicle.
6 right The driver is looking at the windows on the right side of the vehicle.
7 infotainment The driver is looking at the infotainment section of the vehicle, i.e. the location where the radio, temperature settings and the shift knob.
8 steering_wheel The driver is looking at the steering wheel.
9 not_valid Any region where the driver is looking at, that does not correspond to the previous locations and is not a special gaze movement (blinking, or transitioning between 2 states), should be annotated with this annotation. This annotation is also used if the hand-actions are performed, as the gaze is not relevant at that point.
Diagram
closing

Special remarks

At a transition between regions, try to annotate the current gazing region as much as possible. If this is not possible (for example due to blinking during the transition), the annotation should be the next region to be looked at.

⚠️ There must be a label of this level in all frames


Level 2: Blinks

This annotation should be present during a blinking. The blinks usually occur when the driver changes the gazing zone.

Streams used for annotation

To annotate this level, the face camera is primarily used.

Labels

Key                 Label                 Description
0 Blinking The annotation should go from when the driver starts closing the eyes until they are completely open again. Take into account that some people don't fully close their eyes when blinking; they close half their eyes.         

Special remarks

⚠️ If in the current frame the driver is not blinking then leave it without label. To clear a label press .

⚠️ **GitHub.com Fallback** ⚠️