1907.01481 - hassony2/inria-research-wiki GitHub Wiki
Arxiv 2019
[arxiv 1907.01481] HO-3D: A Multi-User, Multi-Object Datasetfor Joint 3D Hand-Object Pose Estimation [PDF] [code] [notes]
Shreyas Hampali, Markus Oberweger, Mahdi Rad, and Vincent Lepetit_
read 02/08/2019
Objective
- generate a dataset and semi-automatic annotations
- method that estimates annotations from RGB-D sequence, optimized globally
- a pose estimation algorithm that works given a single-RGB frame as input
Dataset
- 15 sequences (14 training/1 testing)
- 8 different people manipulating 24 objects (8x3 according to text?)
Setup
- 1 rgb-d camera
- 1 side-view registered RGB camera, serves mostly for evaluation using manual annotations in this view
3D hand pose is predicted relative to object centroid
Annotation
Optimization
- depth term: thresholded l2 loss (to remove noise) over the min of the object and hand depth maps
- silhouette discrepancy: obtained from DeepLabv3 for task of object mask, hand mask, depth mask
- non-interpenetration: detect if collisions using closest object vertex to hand joint
- temporal smooting: simple continuity enforced with l2 loss
Initialization
- initialize object pose for 1 frame
- estimate object poses first in all sequence, starting from annotated frame, in forward and backward directions
- then optimize one frame at the time by taking into account all the other frames for hand+object in forward and backward directions
RGB-input 3d reconstruction output
2D network
- network predicts 2d bbox projections and 2D wrist joint location of hand
- 3D pose is calculated using PnP from 2D projections and corresponding 3D bbox
Lifting network
- takes cropped image centered on hand, predicts 3D joint location relative to wrist joint location and depth of wrist joint
- Use VGG-style base
Generate synthetic data
Evaluation
-
manually annotated 2D frames in orthogonal view
- for 10 randomly chosen frames from 15 sequences (150 frames in total)
- annotate visible joints and 2D corner locations of visible object points
- reproject 3D pose on the second camera
- compute 2D distances between manual annotations and projected semi-automatic annotations
-
depth-based evaluation
- difference between depth maps gt and generated (rendered from estimations)