1704.02201 - hassony2/inria-research-wiki GitHub Wiki
Arxiv 2017
[arxiv 1704.02201] Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor [PDF] [project page- code should appear here] [dataset]
Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, Christian Theobalt
Objective
Estimate in real-time, robustly and accurately hand pose in cluttered environment from moving egocentric RGBD images
Find joint angles of kinematic hand skeleton
Synthesis
created synthHands photorealistic dataset
Pipeline
2 steps (CNNs) :
- Hand localization
- 3D joint localization regression
- temporal smoothing using kinematic skeleton (26 DOF) with bones lengths that are adapted to each user
Hand localization
Compute colored depth map, maps each pixel in color image plane onto depth map
CNN HALnet (Hand Localization Net) used to estimate root point of the hand (wether it is visible or not), outputs heatmap
If max of heatmap low (<0.1) or far from previous max, it is marked uncertain and updated accordingly
crop uses heatmap to estimate most probable location and make depth-dependent crop
Joint regression
JORNet regresses :
-
joint root-relative positions
-
2D position likelihood heatmaps, used to regularize 3D joint positions
Post-processing : Hand Pose Optimization
to enforce bone length constraints and temporal smoothness
Datasets
SynthHands
new photorealistic datset using merged reality:
posing photo-realistic hand model interacting with virtual object to
allows composition with various objects
hand movement is acquired using real time hand tracking (takes advantagge of the fact that we're good at hand-tracking when no object interaction)
allows for variability in hand skin color, shape, pose, texture, background clutter, camera viewpoint...
increase variability:
- random variation along each dimension of default hand mesh
- 12 skin colors
- hand spect variability (size, hair...)
- virtual cameras for 5 egocentric views
- 7 different objects with random texure
- background real RGBD captures
==> RGBD training data
Evaluation
3130 frames in moving egocentric viewpoints
Results
Two step approach outperforms one CNN doing directly the final task
JORNet
3D Euclidian distance
Future work
Train on jointly synthetic and real data by using domain adaptation