1603.06937 - hassony2/inria-research-wiki GitHub Wiki

ECCV 2016

[arxiv 1603.06937] Stacked Hourglass Networks for Human Pose Estimation [PDF] [notes]

Alejandro Newell, Kaiyu Yang, Jia Deng

Objective

The goal is to reliably perform human pose estimation (detect keypoints from RGB images).

This paper proposes a new network architecture to iteratively refine pose-estimation proposals

Synthesis

Structure

The Stacked Hourglass is obtained (as its name implies) by stacking Hourglass modules.

Each Hourglass module is itself composed of several basic modules.

Fine to coarse stacked hourglass description

A basic unit module is composed of :

a Residual module whith 128 conv filters with kerner 1x1,
then 128 conv filters with kernel 3x3,
then 256 filters with kernel 1x1 to which the output of the residual module is added.

basic_module

A hourglass module combines several basic modules while pooling down to low resolution (4x4 pixels) and then upsampling back to original (64x64) resolution by combining features accross multiple resolutions.

hourglass_module

Hourglass modules are then stacked to become the hourglass network

Weights are not shared accross hourglass modules

Input

The input is preprocessed in the following way by the network :

brought down from 256 to 64 pixels through a 7x7 conv layer with stride 2, redisual bloack and max-pooling at the very beginning of our pipeline.
two additional residual blocks precede the hourglass module

Output

The output of a Hourglass module is a set of heatmaps (one per joint) that predicts the probability of a joint's presence at each position of the map.

The ground truth heat maps are initialized with gaussians centered at the ground truth joint position.

In order to generate the final predictions for PCK evaluation:

two horizontal flips are fed to the network
the heatmaps of the two inputs are averaged (after re-flipping the flipped one)
the max activation location for the heatmap for agiven joint is the final keypoint position prediction

Loss

During training, MSE on heatmap values is used.

Intermediate supervision is provided as the loss is applied to predictions of all hourglass modules using same ground truth, and without weighting.

Heatmaps are used at a resolution of 64x64 in order to limit memory usage.

Notes

MPII dataset

As described in this link the scale of a person is the scale with relation to 200 pixels. (The scale is used in the code to produce the centered crops)