pascale_mettes_09_02_2018 - hassony2/inria-research-wiki GitHub Wiki
Objective
Get spatio-temporal action tubes for actions with less and less supervision
Pointly-supervised action localization
Started with dense trajectories
Video is represented by ~10.000 action trajectories
Grouped and filtered into proposals ~1.000
Classify such proposals --> this is the difficult part
But this a supervised task which needs detailed bounding box annotations.
Hypothesis : Having temporal annotations is enough to get spatial cues
They therefore use the proposals at training time and try to get the best proposal (max overlap with ground truth)
In practice it is difficult to get the best proposal.
They introduce a point supervision (one point per frame) and try to define a way to measure how well this point matches the proposal.
Initial idea: measure how centered the point is regarding the proposal. But problem : actions tend to be centered, so full video usually had a high score.
--> some size regularization ?
This allows them to find almost always the best possible proposal.
Works as well as training on ground truth bounding boxes
Action localization with pseudo-annotations
Get some pseudo-annotation:
- action detection
- get independent motion (center of mass of pixels not moving like most pixels)
- action proposals
- center bias (point at the center)
- object heatmaps
multiple instance learning ?
Zero-shot action learning with object embeddings
Traditional zero-shot learning: transfer knowledge from seen actions to unseen actions.
Use objects for action localization ?
Factorize relative position of person and objects in a 9 box square (3x3) And score highly videos where the spatial relationship between the object and person is coherent with the previous seen examples.