acm 2968890 - hassony2/inria-research-wiki GitHub Wiki
NIPS 2014
[ACM 2968890]Two-Stream Convolutional Networks for Action Recognition in Videos [PdF] [notes]
Karen Simonyan, Andrew Zisserman
read 28/07/2017
Objective
Capitalize on both motion and appearance to predict action classes First implementation for which deep features compare with hand-crafted/shallow ones (bag of visual words, ..)
Synthesis
Motivation
Some action are strongly associated with some objects or environments/outfits ==> recognition from single image is relevant
Motion is the other important clues
Pipeline
Two streams combined through late fusion:
- concatenated flow images
- single rgb frames
Both streams implemented as ConvNets
Spatial stream
- pretrained on ImageNet
Motion stream
Vertical and Horizontal flow are stacked consecutively in a block of optical flow images (which is therefore of size 2*the number of images)
Several alternative features
- flow stacking (just stack the optical flow images into an optical flow volume)
- trajectory stacking (for the next frame frame use the optical flows computed at the points where the points from the previous frame would have moved)
- bidirectional flow
Normalization
For each frame, subtract he mean motion (loose attempt to remove camera motion)
This is experimentally usefull.
Network
Pretty much the same for the spatial and the motion streams, CNN-M-2048 structure with 3x3 convolutions, stride 2 Removed a normalization layer for the flow network in order to limit memory consumption.
Multi-task training
Shared weights with just 2 different softmax layers on top, one for each dataset, matching the number of classes for each of the two datasets
SGD with 0.9 momentum, lr 10^-2 and decreased over fixed schedule, mini-batch of 256 samples
For spatial training : 224x224 images
For motion : 224x224x2L inputs
Data augmentation
- Random cropping
- horizontal flipping
- RGB jittering
Testing
Choose 25 evenly sampled frames from a video
For each extract 10 ConvNet inputs (motion and spatial) by selecting 4 corners and center
Compute class scores for the whole video by averaging scores accross sampled frames and crops
Fusion
Compute softmax scores for each network (spatial and motion) and then average or use a linear SVM.
Technical details
4 GPU (3.2 speedup if had only used 1 GPU)
1 day training
Flow computed in advanced and stored by linearly mapping to 0-255 (this remapping is removed at time of decompression)
Results
Experiments on UCF-101 ad HMDB-51
Motion information is more discriminative then spatial information
For spatial network :
- Pretraining is usefull 52 --> 72 % accuracy !
- finetuning only last layer produces roughly same results as finetuning entire network
- more dropout is usefull whhen finetuning all or training from scratch (71 --> 71% for 0.5 --> 0.9 dropout)
- smaller dropout is usefull when finetuning just the last layer (60 --> 73% for 0.9 --> 05 dropout)
For motion network :
- Optical flow stacking marginally bettir then trajectory stacking for 10 stacked layers (80 --> 81% on UCF-101)
- Larger number of stacked frames is usefull, but only marginally after 5 (1 --> 5 --> 10 yields increase on accuracy for optical flow stacking of 74 --> 80 --> 81 on UCF-101)
Multitasking (capitalizing on both datasets) increases accuracy (55% for HMDB) significantly compared to just adding some classes of the UCF-101 dataset (53%) and training just on HMDB-51 for the HMDB-51 dataset
Fusion performs better through SVM of softmax scores (88% on UCF-101) compared to averaging (86.2 %), which matches shallow representations (87%)
For HMDB, shallow representations outperform the deep ones (67% vs 59%)