1704.02201 - hassony2/inria-research-wiki GitHub Wiki

Arxiv 2017

[arxiv 1704.02201] Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor [PDF] [project page- code should appear here] [dataset]

Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, Christian Theobalt

Objective

Estimate in real-time, robustly and accurately hand pose in cluttered environment from moving egocentric RGBD images

Find joint angles of kinematic hand skeleton

Synthesis

created synthHands photorealistic dataset

Pipeline

2 steps (CNNs) :

  • Hand localization
  • 3D joint localization regression
  • temporal smoothing using kinematic skeleton (26 DOF) with bones lengths that are adapted to each user

Hand localization

Compute colored depth map, maps each pixel in color image plane onto depth map

CNN HALnet (Hand Localization Net) used to estimate root point of the hand (wether it is visible or not), outputs heatmap

If max of heatmap low (<0.1) or far from previous max, it is marked uncertain and updated accordingly

crop uses heatmap to estimate most probable location and make depth-dependent crop

Joint regression

JORNet regresses :

  • joint root-relative positions

  • 2D position likelihood heatmaps, used to regularize 3D joint positions

Post-processing : Hand Pose Optimization

to enforce bone length constraints and temporal smoothness

Datasets

SynthHands

new photorealistic datset using merged reality:

posing photo-realistic hand model interacting with virtual object to

allows composition with various objects

hand movement is acquired using real time hand tracking (takes advantagge of the fact that we're good at hand-tracking when no object interaction)

allows for variability in hand skin color, shape, pose, texture, background clutter, camera viewpoint...

increase variability:

  • random variation along each dimension of default hand mesh
  • 12 skin colors
  • hand spect variability (size, hair...)
  • virtual cameras for 5 egocentric views
  • 7 different objects with random texure
  • background real RGBD captures

==> RGBD training data

Evaluation

3130 frames in moving egocentric viewpoints

Results

Two step approach outperforms one CNN doing directly the final task

JORNet

3D Euclidian distance

Future work

Train on jointly synthetic and real data by using domain adaptation