1812.04558.md - hassony2/inria-research-wiki GitHub Wiki
{notes} {paper} {project page} {code PyTorch}
Grounded Human-Object Interaction Hotspots from Video, ICCV'19Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman
Objective
Learn “interaction hotspots”: object regions that anticipate and explain human-object interactions
Datasets
-
OPRA
-
Epic-Kitchen
- annotate gt object affordance heatmaps
Method
Unlike methods relying on human-curated annotations (keypoints or segmentation annotations), attempt to learn statistics of object affordances directly from videos of people interacting with objects.
-
Predict object affordance classes
- frames are encoded into spatially-preserving features, then spatially pooled
- LSTM is used to predict action classes in full video
- predict object-agnostic class (pour-X vs pour-cup)==> generalization to unseen classes
-
Add anticipation module
- learn how to map inactive (not being interacted with) object image to active object image
- gather inactive object images:
- from beginning of sequences
- from catalog of static images
- get active object image
- image for which action classification classifier is maximally confident about the true action class
- enforce rules on predicted active features
- match feature using l2 loss between mapped feature and gt actice feature
- additional classification loss on top of the predicted feature, which should predict same action class
-
Heatmaps at test time
- activation-mapping based: given particular inactive image embedding and an action a, compute the gradient of the score for the action class with respect to each channel of the embedding.
Experiments
- Evaluation: IoU with ground truth annotations