1803.11404 - hassony2/inria-research-wiki GitHub Wiki

CVPR 2018

[arxiv 1803.11404] Cross-modal Deep Variational Hand Pose Estimation [PDF] [notes] [code]

Adrian Spurr, Jie Song, Seonwook Park, Otmar Hilliges

read 10/05/2018

Objective

Learn a cross-modal statistical hand-model with joint embedding of RGB hand images, 2D and 3D poses

Embeddings can then be decoded to produce pose estimates and also generate images

Motivation

Hand poses are low-dimensional (articulated movements are linked according to empirical observations) --> we may be able to learn a low-dimensional embedding

Synthesis

VAE learns to reconstruct hands in same or different modality

Experiments

Evaluates on StereoHands benchmark dataset and zimmermann rendered dataset. Scores:

Root (palm)-relative 9.5 mm mean epe (end point error)
Using complete (scale and handedness) normalization gives 8.56 mm
Surprisingly sing scale or handedness alone (in additional to translation e.g. root-relative) invariance decreases (slightly) performance

PCK on Stereohands is similar to zimmermann (0.98 for both between 20 and 50mm) PCK on Zimmermann is improved, showing superriority of their method (0.85 vs 0.68 AUC netween 20 and 50mm)

Argue that stereohands performance is saturated (statement with wich I agree).

Also uses depth as input and produces results comparable to state of the art.

Given their embedding they can also smoothly walk along the embedding space between two poses and look at progressively changing hand poses and images

Details

RGB encoder uses Resnet-18
Either rely on ground truth handedness (left or right hand) or not. Surpass zimmermann without handedness and scale (normalized by index length) invariance
To obtain palm for Zimmermann dataset they interpolate between wrist and index base locations
They use both views in StereoHands