1804.06208 - hassony2/inria-research-wiki GitHub Wiki
ECCV 2018
[arxiv 1804.06208] Simple Baselines for Human Pose Estimation and Tracking [PDF] [notes] [code]
Bin Xiao, Haiping Wu, Yichen Wei
read 2018/09/19
Objective
The motivation is to step away from the increasingly complicated methods that perform pose estimation. Instead, they compare simple baselines trying to trim down existing architectures.
Synthesis
Provides a simple baseline based on convolutions, downsampling and then 3 upsampling using deconvolution layers.
Argues that residual connections might not be necessary, although unpooling to high resolution feature is.
I only cover the pose estimation part of the paper, but there is a whole analysis of tracking also !
Details
- no residual connections
- deconvolution layers have a ReLU non-linearity and batch-norm is applied at each step
Experiments
Ablation studies
In order of decreasing importance of parameter tweaking !
Changing input image resulution
- 70.4 --> 72.2 %AP (+1.8%) when increase of *1.5 of input image size
- 70.4 --> 60.6 %AP (-9.8!!%) when decreasing the resolution by 2
Removing deconvolution layers
- Going from 3 to 2 deconvolution layers decreases performance from 70.4 to 67.9 %AP (-2.5%)
- This decreases the heat map resolution from 64x48 to 32x24 in size
Increasing ResNet depth
- 70.4 --> 71.4% AP (+1.0%) from ResNet-50 to ResNet-101
- 71.4 --> 72.0% AP (+0.6%) from ResNet-101 to ResNet-152
Changing kernel size
- 70.1 --> 70.3 --> 70.4 when kernel size 2 --> 3 --> 4
- almost no impact in practice !