1803.11404 - hassony2/inria-research-wiki GitHub Wiki
CVPR 2018
[arxiv 1803.11404] Cross-modal Deep Variational Hand Pose Estimation [PDF] [notes] [code]
Adrian Spurr, Jie Song, Seonwook Park, Otmar Hilliges
read 10/05/2018
Objective
Learn a cross-modal statistical hand-model with joint embedding of RGB hand images, 2D and 3D poses
Embeddings can then be decoded to produce pose estimates and also generate images
Motivation
Hand poses are low-dimensional (articulated movements are linked according to empirical observations) --> we may be able to learn a low-dimensional embedding
Synthesis
VAE learns to reconstruct hands in same or different modality
Experiments
Evaluates on StereoHands benchmark dataset and zimmermann rendered dataset. Scores:
- Root (palm)-relative 9.5 mm mean epe (end point error)
- Using complete (scale and handedness) normalization gives 8.56 mm
- Surprisingly sing scale or handedness alone (in additional to translation e.g. root-relative) invariance decreases (slightly) performance
PCK on Stereohands is similar to zimmermann (0.98 for both between 20 and 50mm) PCK on Zimmermann is improved, showing superriority of their method (0.85 vs 0.68 AUC netween 20 and 50mm)
Argue that stereohands performance is saturated (statement with wich I agree).
Also uses depth as input and produces results comparable to state of the art.
Given their embedding they can also smoothly walk along the embedding space between two poses and look at progressively changing hand poses and images
Details
- RGB encoder uses Resnet-18
- Either rely on ground truth handedness (left or right hand) or not. Surpass zimmermann without handedness and scale (normalized by index length) invariance
- To obtain palm for Zimmermann dataset they interpolate between wrist and index base locations
- They use both views in StereoHands