1907.01481 - hassony2/inria-research-wiki GitHub Wiki

Arxiv 2019

[arxiv 1907.01481] HO-3D: A Multi-User, Multi-Object Datasetfor Joint 3D Hand-Object Pose Estimation [PDF] [code] [notes]

Shreyas Hampali, Markus Oberweger, Mahdi Rad, and Vincent Lepetit_

read 02/08/2019

Objective

generate a dataset and semi-automatic annotations
method that estimates annotations from RGB-D sequence, optimized globally
a pose estimation algorithm that works given a single-RGB frame as input

Dataset

15 sequences (14 training/1 testing)
8 different people manipulating 24 objects (8x3 according to text?)

Setup

1 rgb-d camera
1 side-view registered RGB camera, serves mostly for evaluation using manual annotations in this view

3D hand pose is predicted relative to object centroid

Annotation

Optimization

depth term: thresholded l2 loss (to remove noise) over the min of the object and hand depth maps
silhouette discrepancy: obtained from DeepLabv3 for task of object mask, hand mask, depth mask
non-interpenetration: detect if collisions using closest object vertex to hand joint
temporal smooting: simple continuity enforced with l2 loss

Initialization

initialize object pose for 1 frame
estimate object poses first in all sequence, starting from annotated frame, in forward and backward directions
then optimize one frame at the time by taking into account all the other frames for hand+object in forward and backward directions

RGB-input 3d reconstruction output

2D network

network predicts 2d bbox projections and 2D wrist joint location of hand
3D pose is calculated using PnP from 2D projections and corresponding 3D bbox

Lifting network

takes cropped image centered on hand, predicts 3D joint location relative to wrist joint location and depth of wrist joint
Use VGG-style base

Generate synthetic data

Evaluation

manually annotated 2D frames in orthogonal view
- for 10 randomly chosen frames from 15 sequences (150 frames in total)
- annotate visible joints and 2D corner locations of visible object points
- reproject 3D pose on the second camera
- compute 2D distances between manual annotations and projected semi-automatic annotations
depth-based evaluation
- difference between depth maps gt and generated (rendered from estimations)