1604.02115 - hassony2/inria-research-wiki GitHub Wiki

[Arxiv 1604.02115]Trajectory Aligned Features For First Person Action Recognition [project page] [PdF] [notes]

Suriya Singh, Chetan Arora, C. V. Jawahar

Related paper

First Person Action Recognition Using Deep Learned Descriptors Same team, CVPR 2016

Objective

Action recognition in first person videos without explicit hand and object detection/segmentation using dense point trajectories and features following (centered on) those trajectories

Synthesis

Contributions

Novel representation of ego- centric actions based upon simple feature trajectories computed using tracking

  • Extract dense trajectories at multiple spatial scales

  • trajectory descriptor : displacement vectors for one point normalized by the sum of norm of displacements

  • HOG and HOF descriptors in space time around trajectory

  • MBH : motion boundary histogram which uses spatial derivatives of optical flow and can therefore supress the background flow

  • reversed videos also used (justification : humans can recognize an action when played in reverse order), reversed trajectories are computed independantly from forward ones==> improves frame level action recognition from 51 to 54.5 percents of GTEA dataset

  • head motion cancelation on flow : model head movement as 2D affine ==> 54.5 to 56.5%

  • Temporal pyramids (histogram on full video, on two halves of the video, etc...) in practice : 3 levels (1, 2 and 4 splits) ==>58.5%

  • statistics on trajectories at the full video temporal scale (mean and std of x and y coordinates, number of trajectories....) ==>60%

  • camera displacement as global frame to frame 2D translation, represented by sequence of displacement vectors, normalized ==>61%

  • Markov Random Field where likelihood derived from smoothness prior and classifier score. low weight between frames with same action label, high weight between frames with different label. Neighbours : 5 adjacent frames in past and in future. Edge weights as Euclidean distance between global (full frame) HOF histogram and 2 neighboring vertices. Minimum energy estimated using \alph-expansion algorithm ==> 62.5% temporal segmentation accuracy

Datasets

GTEA

ADL

UTE cropped and manually annotated (annotations available on [project page}(http://cvit.iiit.ac.in/research/projects/cvit-projects/first-person-action-recognition)

Extreme Sports actions not involving hand motion (bumpy forward, wald, roll, flip, ...), 60 videos, available on [project page}(http://cvit.iiit.ac.in/research/projects/cvit-projects/first-person-action-recognition)

All videos processed at 15 fps

Definitions

Activity : high level description of what the person is doing (making tea)

Actions : short term actions, closer to the gestures performed by the person (taking sugar, opening tea box, ...)