pascale_mettes_09_02_2018 - hassony2/inria-research-wiki GitHub Wiki

Objective

Get spatio-temporal action tubes for actions with less and less supervision

Pointly-supervised action localization

Started with dense trajectories

Video is represented by ~10.000 action trajectories

Grouped and filtered into proposals ~1.000

Classify such proposals --> this is the difficult part

But this a supervised task which needs detailed bounding box annotations.

Hypothesis : Having temporal annotations is enough to get spatial cues

They therefore use the proposals at training time and try to get the best proposal (max overlap with ground truth)

In practice it is difficult to get the best proposal.

They introduce a point supervision (one point per frame) and try to define a way to measure how well this point matches the proposal.

Initial idea: measure how centered the point is regarding the proposal. But problem : actions tend to be centered, so full video usually had a high score.

--> some size regularization ?

This allows them to find almost always the best possible proposal.

Works as well as training on ground truth bounding boxes

Action localization with pseudo-annotations

Get some pseudo-annotation:

action detection
get independent motion (center of mass of pixels not moving like most pixels)
action proposals
center bias (point at the center)
object heatmaps

multiple instance learning ?

Zero-shot action learning with object embeddings

Traditional zero-shot learning: transfer knowledge from seen actions to unseen actions.

Use objects for action localization ?

Factorize relative position of person and objects in a 9 box square (3x3) And score highly videos where the spatial relationship between the object and person is coherent with the previous seen examples.