1704.02224 - hassony2/inria-research-wiki GitHub Wiki

Arxiv 2017

[arxiv 1704.02224] Hand3D: Hand Pose Estimation using 3D Neural Network [PDF] [project page] [notes]

Xiaoming Deng, Shuo Yang, Yinda Zhang, Ping Tan, Liang Chang, Hongan Wang

read 19/05/2017

Objective

Using only one depth image, convert depth map to 3D volumetric representation

Convert depth map to 3D volumetric representation, feed it into a 3D CNN to directly produce pose in 3D

Synthesis

Train a FCN (fully convolutional network) to complete hand shape from a single depth view.

Augment existing datasets : use existing poses to render depth images

Pipeline

  • build a reference coordinate at the COM (center of mass) of the foreground region

  • convert inpu depth to truncated signed distance function (TSDF) of resolution 60^3

  • TSDF refinement network : removes artifacts caused by noisy and missing depth

    • Hourglass like structure :
      • 2 3D conv, Pooling , relu * 3
      • 2 3D conv, UpPooling , relu * 3
      • with shortcuts between layers of same scale
  • refined TSDF is fed to 3D pose network to estimate 3D lactaion of each joint

    • Structure : 2* 3D conv, pooling, Relu, 3 layers of FC
    • Output : joints positions as coordinates (3* nb joints vector)
    • L2 loss

Results

metrics : percentage of samples with predicted joints within maximum distance from ground truth

==> state of the art results (74 and 96% for 50mm threshold and 18 amd 40% at 20 mm (NYU and ICVL respectively))

presented on ICVL and NYU hand pose dataset

Definitions

SDF (signed distance function) : function that assigns to a point it's signed distance to the boundary of a set. Signed because positive if inside set and negative if outside set.

Truncated (above and below?) a certain value.