1705.02953 - hassony2/inria-research-wiki GitHub Wiki
2017 Arxiv Paper
[Arxiv 1705.02953] Temporal Segment Networks for Action Recognition in Videos [project page] [PdF] [code (caffe)] [notes]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool
read 25/07/2017
Objective
Create an action recognition framework that captures long-term temporal information (potentially over the entire video)
The framework, which they name TSN (Temporal Segment Networks) can be applied to untrimmed videos, and used in real time
Synthesis
Initial observation: previous work often relies on dense sampling which has the shortcoming that consecutive frames are highly redundant
Pipeline
-
divide the video into a fixed number of windows of fixed duration
-
perform action recognition over the temporal segments for each window independently
- a snippet is randomly sampled from the segment
- a convNet operates on each snippet and produces class scoresq
-
aggregate window scores (therefore aggregating snippets, since we sample one snippet per window) using various aggregation functions
- aggregation is performed by first computing consensus class scores, and then a softmax is applied to compute final class scores
- this produces video-level recognition results
Inputs
Several inputs are compared
- single RGB image
- stacked RGB differences
- stacked optical flow field
- stacked warped optical flow field
- *warped optical flow intends to suppress camera motion by estimating a homography between the frames based on non-human points, and then computing the optical flow between the initial and the warped next frame (<=>to which a homography has been applied) frames, see Action recognition with improved trajectories by Wang and Schmid
Aggregation functions
Various aggregation functions are tested
-
max pooling : seeks the most discriminative snippet (with strongest score) and selects the video-level class according to it. This aggregation scheme lacks modeling capability over several snippets
-
average pooling : seeks to leverage responses of all snippets, and uses the maximum mean activation as video-level prediction, doesn't leverage temporal modelling
-
top-k pooling, introduced in this paper as a new aggregation function. Intends to strikes a balance between max pooling and average pooling:
- select k most discriminative samples
- perform average pooling over these snippets
- Note : it is a generalized formulation of average an max pooling which are respectively top-classe_nb pooling and top-1 pooling
-
linear weighting : final score is a weighted linear combination of individual snippet scores, the weights are learned. This function is expected to learn importance weights for different phases of an action class.
- a limitation is that those weights are shared among all samples, and are therefore not video-specific
-
attention weigthing
- goal : automatically assign an adaptive weight to each snippet according to video content
- for each snippet, the class score is weighted by an attention score that depends on the snippet itself
- in the current implementation, the attention weigth is the softmax of linear combination of the activations of the last hidden layer of a ConvNet, the attention weights are also learned during training
ConvNets
Cross modality initialization
Idea : use imagenet training and apply it to optical flow images, or other
For this:
- discretize optical flow in interval 0-255
- average weights of RGB accross RGB channels, and replicate this mean as many times as number of channels of input (number of flow frames)
- keep all other weights as they are
Strategies to avoid overfitting
Partial Batch Normalization :
- freeze the mean and variance params in all except the first layer in order to avoid overfitting (because there is not enough training samples in the target dataset)
Add extra dropout layer
With high dropout ratio (0.8) after the global pooling layer
Data augmentation
- random cropping
- horizontal flipping
- corner cropping (images selected from center or corners in order not to center information)
- scale jittering
- cropped regions sizes are randomly selected in {256, 224, 192, 168} and are resized to 224*224
Experiments
Trimmed videos
At test time, crop 4 corners, center, and their horizontal flipping and then average pooling to aggregate predictions
When multiple modalities are present, they are average weighted, with fusion weights empirically determined
Untrimmed videos
Section 4.2 details how to apply TSN to untrimmed videos
Results
Input comparison
RGB + Flow performs well (92% accuracy)
Adding TSN (sparse snippet sampling and consensus aggregation) produces 2% boost
RGB + Flow + Warp with TSN performs better 94% --> 95% on UCF101, but 3 times slower
RGB + RGB differences gives 86% without TSN and 91% with TSN, but it has the advantage of being 25 times faster then RGB + Flow !
TSN evaluation
For TSN, the number of segments matters, with 7 segments being pretty optimal (performance seems to saturate after that), for one segment, TSN degenerates to plain two-stream architecture
Aggregation function comparison
On trimmed videos, better performances are obtained for simple aggregation (average pooling)
On ActivityNet, top-K and attention weighting outperform basic aggregation functions
The intuition is that in untrimmed complex action videos, more complex aggregation functions outperform simple euristics
ConvNet architecture comparison
Inception V3 + TSN-top3 works best (compared with GoogleNet, VGGNet=16, ResNet, ...)