1812.04558.md - hassony2/inria-research-wiki GitHub Wiki

Grounded Human-Object Interaction Hotspots from Video, ICCV'19 {notes} {paper} {project page} {code PyTorch}

Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

Objective

Learn “interaction hotspots”: object regions that anticipate and explain human-object interactions

Datasets

OPRA
Epic-Kitchen
- annotate gt object affordance heatmaps

Method

Unlike methods relying on human-curated annotations (keypoints or segmentation annotations), attempt to learn statistics of object affordances directly from videos of people interacting with objects.

Predict object affordance classes
- frames are encoded into spatially-preserving features, then spatially pooled
- LSTM is used to predict action classes in full video
- predict object-agnostic class (pour-X vs pour-cup)==> generalization to unseen classes
Add anticipation module
- learn how to map inactive (not being interacted with) object image to active object image
- gather inactive object images:
  - from beginning of sequences
  - from catalog of static images
- get active object image
  - image for which action classification classifier is maximally confident about the true action class
- enforce rules on predicted active features
  - match feature using l2 loss between mapped feature and gt actice feature
  - additional classification loss on top of the predicted feature, which should predict same action class
Heatmaps at test time
- activation-mapping based: given particular inactive image embedding and an action a, compute the gradient of the score for the action class with respect to each channel of the embedding.

Experiments

Evaluation: IoU with ground truth annotations