1904.07846 - hassony2/inria-research-wiki GitHub Wiki

CVPR 2019

[Arxiv 1904.07846]Temporal Cycle-Consistency Learning [PdF] [project page] [notes]

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

read 2019/07/03

Motivation

Key (visual) moments are common accross several instantiations of the same actions (for instance gettign in contact with a cup in the action pouring)

Objective

Obtain good per-frame video embeddings using self-supervision, leveraging a large number of videos of the same action.

Leverage temporal cycle consistency to learn about the relationship between different intermediate steps of an action.

Demonstrate exprissivity of captured features on tasks such as tracking the progress of an action of action phase classification

Annotate two dataset [Penn Action] and [Pouring] at the frame level for evaluation of fine-grained video understandign tasks

Synthesis

Learn an embedding that maximizes cycle consistency accross frames that match to the same moment in the video.

The surrogate task is the alignment of video sequences of the same action.

Method

Embedding

Frames are encoded using a neural network. The whole video sequences are encoded this way.

Cycle consistency

Find the nearest neighbor from one video to another in the embedded space, and check whether this cycle is consistent. For the embedding to be learnt using cycle-consistency the cycle-consistency loss should be differentiable.

classification consistency
- logits are used to classify whether the two samples are neighbors or not, this gives soft weights
- these weights are used to estimate a weighted averaged candidate embedding for the target video sequence
- this forces embeddings of neighbors to be close while embeddings from non-neighbors should be far
- however, by defining a classification loss, the notion of temporal proximity is lost
regression consistency
- Use same proximity weights that rely on softmax function of the norms of embedding distances
- compute Soft argmax
- Impose gaussian prior on prior distribution of weights by minimizing the normalized squared distance of softargmax to correct time step ++- Regularize by penalizing the log of the variance (penalizes high variances, log for numerical stability)

Training

Optimize randomly selected sequences pairs
for each pair, select randomly frames within each sequence, and optimize
ResNet50 backbone, features from Conv4c (size of features 14x14x1024)
features of frame and k context frames are stacked along the time dimension
3D max-pooling is used + 2 fc layers and a linear projection to get a 128-dimensional embedding for each frame
features are trained using cycle-consistency
then encoder is frozen
svm classifiers and linear regressors are then trained on top of the frozen features

Datasets

Pouring (interactions with objects): key events + phases (in-between key events)
- key event : Start-Hand touches bottle, Liquid starts exiting, Pouring complete, Bottle back on table-End
Penn actions: human doing sports or exercise

Evaluation metrics

Phase classification accuracy: per-frame phase classification accuracy (with SVM classifier on phase labels)
Phase progression: difference in time-stamps between given frame and key event normalized by the number of frames present in the dataset
Kendall's Tau:
- pick two frames in a video
- pick nearest neighbors of each of the two frames
- check if ordering between original frames and nearest neighbor frames is preserved (concordant pairs) or inverted (discordant pairs)
- tau = (no. of concordant pairs - no of discordant pairs) / (n(n - 1) /2)
- tau is in [-1, 1]
- note that this is problematic in presence of repetitive frame (periodic motion such as jumping rope, or slow motions)

Experiments

Ablation

compare cycle losses (regression, classification, MSE)

compute difference of indices (without variance awareness) : (MSE)
classification loss (CC)
cycle regression, variance-aware (CR)

CR > CC > MSE

Cycle regression is best by a significant margin on all 3 losses !

Comparison with other self-supervision baselines

SaL : Shuffle and learn, shuffle triplet of frames and predict whether shuffled or not
TCN : Time Contrastive Networks
- better results from scratch (experimental bias ?)
- similar results to TCN when fine-tuning