1904.07846 - hassony2/inria-research-wiki GitHub Wiki
CVPR 2019
[Arxiv 1904.07846]Temporal Cycle-Consistency Learning [PdF] [project page] [notes]
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
read 2019/07/03
Motivation
Key (visual) moments are common accross several instantiations of the same actions (for instance gettign in contact with a cup in the action pouring)
Objective
Obtain good per-frame video embeddings using self-supervision, leveraging a large number of videos of the same action.
Leverage temporal cycle consistency to learn about the relationship between different intermediate steps of an action.
Demonstrate exprissivity of captured features on tasks such as tracking the progress of an action of action phase classification
Annotate two dataset [Penn Action] and [Pouring] at the frame level for evaluation of fine-grained video understandign tasks
Synthesis
Learn an embedding that maximizes cycle consistency accross frames that match to the same moment in the video.
The surrogate task is the alignment of video sequences of the same action.
Method
Embedding
Frames are encoded using a neural network. The whole video sequences are encoded this way.
Cycle consistency
Find the nearest neighbor from one video to another in the embedded space, and check whether this cycle is consistent. For the embedding to be learnt using cycle-consistency the cycle-consistency loss should be differentiable.
-
classification consistency
- logits are used to classify whether the two samples are neighbors or not, this gives soft weights
- these weights are used to estimate a weighted averaged candidate embedding for the target video sequence
- this forces embeddings of neighbors to be close while embeddings from non-neighbors should be far
- however, by defining a classification loss, the notion of temporal proximity is lost
-
regression consistency
- Use same proximity weights that rely on softmax function of the norms of embedding distances
- compute Soft argmax
- Impose gaussian prior on prior distribution of weights by minimizing the normalized squared distance of softargmax to correct time step ++- Regularize by penalizing the log of the variance (penalizes high variances, log for numerical stability)
Training
-
Optimize randomly selected sequences pairs
-
for each pair, select randomly frames within each sequence, and optimize
-
ResNet50 backbone, features from Conv4c (size of features 14x14x1024)
-
features of frame and k context frames are stacked along the time dimension
-
3D max-pooling is used + 2 fc layers and a linear projection to get a 128-dimensional embedding for each frame
-
features are trained using cycle-consistency
-
then encoder is frozen
-
svm classifiers and linear regressors are then trained on top of the frozen features
Datasets
- Pouring (interactions with objects): key events + phases (in-between key events)
- key event : Start-Hand touches bottle, Liquid starts exiting, Pouring complete, Bottle back on table-End
- Penn actions: human doing sports or exercise
Evaluation metrics
- Phase classification accuracy: per-frame phase classification accuracy (with SVM classifier on phase labels)
- Phase progression: difference in time-stamps between given frame and key event normalized by the number of frames present in the dataset
- Kendall's Tau:
- pick two frames in a video
- pick nearest neighbors of each of the two frames
- check if ordering between original frames and nearest neighbor frames is preserved (concordant pairs) or inverted (discordant pairs)
- tau = (no. of concordant pairs - no of discordant pairs) / (n(n - 1) /2)
- tau is in [-1, 1]
- note that this is problematic in presence of repetitive frame (periodic motion such as jumping rope, or slow motions)
Experiments
Ablation
compare cycle losses (regression, classification, MSE)
- compute difference of indices (without variance awareness) : (MSE)
- classification loss (CC)
- cycle regression, variance-aware (CR)
CR > CC > MSE
Cycle regression is best by a significant margin on all 3 losses !
Comparison with other self-supervision baselines
- SaL : Shuffle and learn, shuffle triplet of frames and predict whether shuffled or not
- TCN : Time Contrastive Networks
- better results from scratch (experimental bias ?)
- similar results to TCN when fine-tuning