acm 2968890 - hassony2/inria-research-wiki GitHub Wiki

NIPS 2014

[ACM 2968890]Two-Stream Convolutional Networks for Action Recognition in Videos [PdF] [notes]

Karen Simonyan, Andrew Zisserman

read 28/07/2017

Objective

Capitalize on both motion and appearance to predict action classes First implementation for which deep features compare with hand-crafted/shallow ones (bag of visual words, ..)

Synthesis

Motivation

Some action are strongly associated with some objects or environments/outfits ==> recognition from single image is relevant

Motion is the other important clues

Pipeline

Two streams combined through late fusion:

  • concatenated flow images
  • single rgb frames

Both streams implemented as ConvNets

Spatial stream

  • pretrained on ImageNet

Motion stream

Vertical and Horizontal flow are stacked consecutively in a block of optical flow images (which is therefore of size 2*the number of images)

Several alternative features

  • flow stacking (just stack the optical flow images into an optical flow volume)
  • trajectory stacking (for the next frame frame use the optical flows computed at the points where the points from the previous frame would have moved)
  • bidirectional flow

Normalization

For each frame, subtract he mean motion (loose attempt to remove camera motion)

This is experimentally usefull.

Network

Pretty much the same for the spatial and the motion streams, CNN-M-2048 structure with 3x3 convolutions, stride 2 Removed a normalization layer for the flow network in order to limit memory consumption.

Multi-task training

Shared weights with just 2 different softmax layers on top, one for each dataset, matching the number of classes for each of the two datasets

SGD with 0.9 momentum, lr 10^-2 and decreased over fixed schedule, mini-batch of 256 samples

For spatial training : 224x224 images

For motion : 224x224x2L inputs

Data augmentation

  • Random cropping
  • horizontal flipping
  • RGB jittering

Testing

Choose 25 evenly sampled frames from a video

For each extract 10 ConvNet inputs (motion and spatial) by selecting 4 corners and center

Compute class scores for the whole video by averaging scores accross sampled frames and crops

Fusion

Compute softmax scores for each network (spatial and motion) and then average or use a linear SVM.

Technical details

4 GPU (3.2 speedup if had only used 1 GPU)

1 day training

Flow computed in advanced and stored by linearly mapping to 0-255 (this remapping is removed at time of decompression)

Results

Experiments on UCF-101 ad HMDB-51

Motion information is more discriminative then spatial information

For spatial network :

  • Pretraining is usefull 52 --> 72 % accuracy !
  • finetuning only last layer produces roughly same results as finetuning entire network
  • more dropout is usefull whhen finetuning all or training from scratch (71 --> 71% for 0.5 --> 0.9 dropout)
  • smaller dropout is usefull when finetuning just the last layer (60 --> 73% for 0.9 --> 05 dropout)

For motion network :

  • Optical flow stacking marginally bettir then trajectory stacking for 10 stacked layers (80 --> 81% on UCF-101)
  • Larger number of stacked frames is usefull, but only marginally after 5 (1 --> 5 --> 10 yields increase on accuracy for optical flow stacking of 74 --> 80 --> 81 on UCF-101)

Multitasking (capitalizing on both datasets) increases accuracy (55% for HMDB) significantly compared to just adding some classes of the UCF-101 dataset (53%) and training just on HMDB-51 for the HMDB-51 dataset

Fusion performs better through SVM of softmax scores (88% on UCF-101) compared to averaging (86.2 %), which matches shallow representations (87%)

For HMDB, shallow representations outperform the deep ones (67% vs 59%)