1604.02115 - hassony2/inria-research-wiki GitHub Wiki

[Arxiv 1604.02115]Trajectory Aligned Features For First Person Action Recognition [project page] [PdF] [notes]

Suriya Singh, Chetan Arora, C. V. Jawahar

Objective

Action recognition in first person videos without explicit hand and object detection/segmentation using dense point trajectories and features following (centered on) those trajectories

Synthesis

Contributions

Novel representation of ego- centric actions based upon simple feature trajectories computed using tracking

Extract dense trajectories at multiple spatial scales
trajectory descriptor : displacement vectors for one point normalized by the sum of norm of displacements
HOG and HOF descriptors in space time around trajectory
MBH : motion boundary histogram which uses spatial derivatives of optical flow and can therefore supress the background flow
reversed videos also used (justification : humans can recognize an action when played in reverse order), reversed trajectories are computed independantly from forward ones==> improves frame level action recognition from 51 to 54.5 percents of GTEA dataset
head motion cancelation on flow : model head movement as 2D affine ==> 54.5 to 56.5%
Temporal pyramids (histogram on full video, on two halves of the video, etc...) in practice : 3 levels (1, 2 and 4 splits) ==>58.5%
statistics on trajectories at the full video temporal scale (mean and std of x and y coordinates, number of trajectories....) ==>60%
camera displacement as global frame to frame 2D translation, represented by sequence of displacement vectors, normalized ==>61%
Markov Random Field where likelihood derived from smoothness prior and classifier score. low weight between frames with same action label, high weight between frames with different label. Neighbours : 5 adjacent frames in past and in future. Edge weights as Euclidean distance between global (full frame) HOF histogram and 2 neighboring vertices. Minimum energy estimated using \alph-expansion algorithm ==> 62.5% temporal segmentation accuracy

Datasets

GTEA

ADL

UTE cropped and manually annotated (annotations available on [project page}(http://cvit.iiit.ac.in/research/projects/cvit-projects/first-person-action-recognition)

Extreme Sports actions not involving hand motion (bumpy forward, wald, roll, flip, ...), 60 videos, available on [project page}(http://cvit.iiit.ac.in/research/projects/cvit-projects/first-person-action-recognition)

All videos processed at 15 fps

Definitions

Activity : high level description of what the person is doing (making tea)

Actions : short term actions, closer to the gestures performed by the person (taking sugar, opening tea box, ...)