1705.02953 - hassony2/inria-research-wiki GitHub Wiki

2017 Arxiv Paper

[Arxiv 1705.02953] Temporal Segment Networks for Action Recognition in Videos [project page] [PdF] [code (caffe)] [notes]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool

read 25/07/2017

Objective

Create an action recognition framework that captures long-term temporal information (potentially over the entire video)

The framework, which they name TSN (Temporal Segment Networks) can be applied to untrimmed videos, and used in real time

Synthesis

Initial observation: previous work often relies on dense sampling which has the shortcoming that consecutive frames are highly redundant

Pipeline

divide the video into a fixed number of windows of fixed duration
perform action recognition over the temporal segments for each window independently
- a snippet is randomly sampled from the segment
- a convNet operates on each snippet and produces class scoresq
aggregate window scores (therefore aggregating snippets, since we sample one snippet per window) using various aggregation functions
- aggregation is performed by first computing consensus class scores, and then a softmax is applied to compute final class scores
- this produces video-level recognition results

Inputs

Several inputs are compared

single RGB image
stacked RGB differences
stacked optical flow field
stacked warped optical flow field
- *warped optical flow intends to suppress camera motion by estimating a homography between the frames based on non-human points, and then computing the optical flow between the initial and the warped next frame (<=>to which a homography has been applied) frames, see Action recognition with improved trajectories by Wang and Schmid

Aggregation functions

Various aggregation functions are tested

max pooling : seeks the most discriminative snippet (with strongest score) and selects the video-level class according to it. This aggregation scheme lacks modeling capability over several snippets
average pooling : seeks to leverage responses of all snippets, and uses the maximum mean activation as video-level prediction, doesn't leverage temporal modelling
top-k pooling, introduced in this paper as a new aggregation function. Intends to strikes a balance between max pooling and average pooling:
- select k most discriminative samples
- perform average pooling over these snippets
- Note : it is a generalized formulation of average an max pooling which are respectively top-classe_nb pooling and top-1 pooling
linear weighting : final score is a weighted linear combination of individual snippet scores, the weights are learned. This function is expected to learn importance weights for different phases of an action class.
- a limitation is that those weights are shared among all samples, and are therefore not video-specific
attention weigthing
- goal : automatically assign an adaptive weight to each snippet according to video content
- for each snippet, the class score is weighted by an attention score that depends on the snippet itself
- in the current implementation, the attention weigth is the softmax of linear combination of the activations of the last hidden layer of a ConvNet, the attention weights are also learned during training

ConvNets

Cross modality initialization

Idea : use imagenet training and apply it to optical flow images, or other

For this:

discretize optical flow in interval 0-255
average weights of RGB accross RGB channels, and replicate this mean as many times as number of channels of input (number of flow frames)
keep all other weights as they are

Strategies to avoid overfitting

Partial Batch Normalization :

freeze the mean and variance params in all except the first layer in order to avoid overfitting (because there is not enough training samples in the target dataset)

Add extra dropout layer

With high dropout ratio (0.8) after the global pooling layer

Data augmentation

random cropping
horizontal flipping
corner cropping (images selected from center or corners in order not to center information)
scale jittering
- cropped regions sizes are randomly selected in {256, 224, 192, 168} and are resized to 224*224

Experiments

Trimmed videos

At test time, crop 4 corners, center, and their horizontal flipping and then average pooling to aggregate predictions

When multiple modalities are present, they are average weighted, with fusion weights empirically determined

Untrimmed videos

Section 4.2 details how to apply TSN to untrimmed videos

Results

Input comparison

RGB + Flow performs well (92% accuracy)

Adding TSN (sparse snippet sampling and consensus aggregation) produces 2% boost

RGB + Flow + Warp with TSN performs better 94% --> 95% on UCF101, but 3 times slower

RGB + RGB differences gives 86% without TSN and 91% with TSN, but it has the advantage of being 25 times faster then RGB + Flow !

TSN evaluation

For TSN, the number of segments matters, with 7 segments being pretty optimal (performance seems to saturate after that), for one segment, TSN degenerates to plain two-stream architecture

Aggregation function comparison

On trimmed videos, better performances are obtained for simple aggregation (average pooling)

On ActivityNet, top-K and attention weighting outperform basic aggregation functions

The intuition is that in untrimmed complex action videos, more complex aggregation functions outperform simple euristics

ConvNet architecture comparison

Inception V3 + TSN-top3 works best (compared with GoogleNet, VGGNet=16, ResNet, ...)