1812.04558.md - hassony2/inria-research-wiki GitHub Wiki

Grounded Human-Object Interaction Hotspots from Video, ICCV'19 {notes} {paper} {project page} {code PyTorch}

Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

Objective

Learn “interaction hotspots”: object regions that anticipate and explain human-object interactions

Datasets
  • OPRA

  • Epic-Kitchen

    • annotate gt object affordance heatmaps
Method

Unlike methods relying on human-curated annotations (keypoints or segmentation annotations), attempt to learn statistics of object affordances directly from videos of people interacting with objects.

  • Predict object affordance classes

    • frames are encoded into spatially-preserving features, then spatially pooled
    • LSTM is used to predict action classes in full video
    • predict object-agnostic class (pour-X vs pour-cup)==> generalization to unseen classes
  • Add anticipation module

    • learn how to map inactive (not being interacted with) object image to active object image
    • gather inactive object images:
      • from beginning of sequences
      • from catalog of static images
    • get active object image
      • image for which action classification classifier is maximally confident about the true action class
    • enforce rules on predicted active features
      • match feature using l2 loss between mapped feature and gt actice feature
      • additional classification loss on top of the predicted feature, which should predict same action class
  • Heatmaps at test time

    • activation-mapping based: given particular inactive image embedding and an action a, compute the gradient of the score for the action class with respect to each channel of the embedding.
Experiments
  • Evaluation: IoU with ground truth annotations