1907.01481 - hassony2/inria-research-wiki GitHub Wiki

Arxiv 2019

[arxiv 1907.01481] HO-3D: A Multi-User, Multi-Object Datasetfor Joint 3D Hand-Object Pose Estimation [PDF] [code] [notes]

Shreyas Hampali, Markus Oberweger, Mahdi Rad, and Vincent Lepetit_

read 02/08/2019

Objective

  • generate a dataset and semi-automatic annotations
  • method that estimates annotations from RGB-D sequence, optimized globally
  • a pose estimation algorithm that works given a single-RGB frame as input

Dataset

  • 15 sequences (14 training/1 testing)
  • 8 different people manipulating 24 objects (8x3 according to text?)

Setup

  • 1 rgb-d camera
  • 1 side-view registered RGB camera, serves mostly for evaluation using manual annotations in this view

3D hand pose is predicted relative to object centroid

Annotation

Optimization

  • depth term: thresholded l2 loss (to remove noise) over the min of the object and hand depth maps
  • silhouette discrepancy: obtained from DeepLabv3 for task of object mask, hand mask, depth mask
  • non-interpenetration: detect if collisions using closest object vertex to hand joint
  • temporal smooting: simple continuity enforced with l2 loss

Initialization

  • initialize object pose for 1 frame
  • estimate object poses first in all sequence, starting from annotated frame, in forward and backward directions
  • then optimize one frame at the time by taking into account all the other frames for hand+object in forward and backward directions

RGB-input 3d reconstruction output

2D network

  • network predicts 2d bbox projections and 2D wrist joint location of hand
  • 3D pose is calculated using PnP from 2D projections and corresponding 3D bbox

Lifting network

  • takes cropped image centered on hand, predicts 3D joint location relative to wrist joint location and depth of wrist joint
  • Use VGG-style base

Generate synthetic data

Evaluation

  • manually annotated 2D frames in orthogonal view

    • for 10 randomly chosen frames from 15 sequences (150 frames in total)
    • annotate visible joints and 2D corner locations of visible object points
    • reproject 3D pose on the second camera
    • compute 2D distances between manual annotations and projected semi-automatic annotations
  • depth-based evaluation

    • difference between depth maps gt and generated (rendered from estimations)