1705.07750 - hassony2/inria-research-wiki GitHub Wiki

CVPR 2017

[Arxiv 1705.07750] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [PdF] [code] [notes]

Joao Carreira, Andrew Zisserman

Objective

Introduce network inflation by transposing 2d networks to 3d networks which allows to take advantage of efficient architectures from 2d neural networks

Use several tricks to use ImageNet weights for initialization of the 3d network's learned parameters

They also introduce a new dataset, the kinetics dataset with 400 person-person actions (shaking hands, kissing, ...) and person-object actions (washing dishes, mowing lawn, ...), for a total of 240k training videos

Synthesis

Neural network inflation

2d networks take images as input, and the 2d convolutions have 3 dimensions (height, width, input_channels), the convolutions are applied with a spatial stride
3d networks take spatio-temporal blocks (successive frames from a video) as input, the 3d convolutions have 4 dimensions (height, width, time, input_channels), the convolutions are slided through the spatial and temporal dimensions of the input.

As neural network architectures have been explored in the 2d input space, the idea is to capitalize on the knowledge by inflating the 2d architectures to 3d ones by adding a temporal dimension.

For convolutional layers, which are typically square, NxN kernel sizes become NxNxN. Pooling layers are also inflated.

Weight transfer

They introduce the 'boring-video' fixed point, which allows to preserve predictions when going from a 2d to a 3d network on a ImageNet sample. To be 'fed' into the 3d network, the image is replicated along the temporal dimension.

To obtain the fixed point (e.g. same values in the spatial dimensions for each layer's output activations between the 2d network and it's 3d counterpart) :

for convolutions, the kernel weights are replicated along the time dimension and weighting them by 1/N
no modifications need to occur for pooling layers (beyond the kernel dimension inflation) and non-linear activations

Experiments

They pretrain their network on the kinetics dataset using SGD with momentum of 0.9 and taking random 224x224 patches

We should note that they use inputs of 64 temporal frames in the temporal dimension (vs 16 112x112 frames for C3D)

They finetune on UCF-101 and HMDB-51

Results

Compared to the original two-stream, two stream I3d gives +10% in accuracy percentage on UCF-101, +20% on HMDB-51