1804.06208 - hassony2/inria-research-wiki GitHub Wiki

ECCV 2018

[arxiv 1804.06208] Simple Baselines for Human Pose Estimation and Tracking [PDF] [notes] [code]

Bin Xiao, Haiping Wu, Yichen Wei

read 2018/09/19

Objective

The motivation is to step away from the increasingly complicated methods that perform pose estimation. Instead, they compare simple baselines trying to trim down existing architectures.

Synthesis

Provides a simple baseline based on convolutions, downsampling and then 3 upsampling using deconvolution layers.

Argues that residual connections might not be necessary, although unpooling to high resolution feature is.

I only cover the pose estimation part of the paper, but there is a whole analysis of tracking also !

Details

  • no residual connections
  • deconvolution layers have a ReLU non-linearity and batch-norm is applied at each step

Experiments

Ablation studies

In order of decreasing importance of parameter tweaking !

Changing input image resulution

  • 70.4 --> 72.2 %AP (+1.8%) when increase of *1.5 of input image size
  • 70.4 --> 60.6 %AP (-9.8!!%) when decreasing the resolution by 2

Removing deconvolution layers

  • Going from 3 to 2 deconvolution layers decreases performance from 70.4 to 67.9 %AP (-2.5%)
  • This decreases the heat map resolution from 64x48 to 32x24 in size

Increasing ResNet depth

  • 70.4 --> 71.4% AP (+1.0%) from ResNet-50 to ResNet-101
  • 71.4 --> 72.0% AP (+0.6%) from ResNet-101 to ResNet-152

Changing kernel size

  • 70.1 --> 70.3 --> 70.4 when kernel size 2 --> 3 --> 4
  • almost no impact in practice !