1711.08996 - hassony2/inria-research-wiki GitHub Wiki

CVPR 2018

[arxiv 1711.08996] Dense 3D Regression for Hand Pose Estimation [PDF] [notes]

Chengde Wan, Thomas Probst, Luc Van Gool, Angela Yao

read 01/07/2018

CVPR 2018

Objective

Predict 3D and 2D coordinates from depth maps Better or match state of the art on 3D hand pose estimation from depth on ICVL, MSRA and NYU datasets

Synthesis

Neural network

  • Input: input point coordinates \in (|R^{h x w x 3}), 3 for x, y, z
  • Outputs : ....- 2D heatmaps (typical for hourglass network) ....- 3D heatmaps (simple extension of 2D heatmaps to 3D) \in |R^{h, w, 1}, which for each point of the image encodes the distance in 3D to the joint (so unlike 2D heatmaps, not necessarily regular centered on the joint) ....- 3D offset unit vector fields (for points less than some radius \theta away from the joint, unit vector encoding orientation between the point and the joint) \in |R^{h, w, 3}
  • Architecture : iterative neural network

Post processing

Using mean shift: see algo page 4.

  • Step 1 Estimate of position of joint can be computed at each (image) position using P = D + theta(1-S)(hadamard) V Where P is the joint coordinates, D the 3D coordinates of the point from the current location of the image, theta is the radius of the heatmap (which is linearly decreasing and non heatmap like usually in hourglass), S is the 3D heatmap and V is the unit vector field (multiplying by V effectively projects the distance estimated from the 3D heatmap into the right direction)
  • Step 2 + 3 select top K (usually 5) positions with highest 3D heatmap scores and get corresponding estimated coordinates
  • Step 4 + 5 Project onto 2D and get corresponding 2D heatmap
  • Step 6 + 7 +8 Do something iteratively which must be mean-shift (?)

Experiments

Compare with various baselines

Direct regression (input to 3D cooridnates performs worst), adding additional supervision as 2D heatmaps marginally improves the results (0.2mm, e.g. ~negligible)

Using the combined loss which takes advantage of unit vector fields and 3D heatmaps in addition to 2D heatmaps improves hourglass-based 3D regression and 2D detection baseline by ~2.3mm, e.g. not negligible

Predicting only z and combining it with 2d heatmaps for x,y joint coordinate estimation outperforms directly regressing x, y, z in a regression step with additional loss on 2d (which helps to learn better features but is not used in final coordinate estimation at test time).

Other remarks

For points too far away, forcing them to produce 0 instead of just excluding them from the loss seems to improve performances

Notes

Future work

From rgb and extend to hands grasping objects