1411.4038 - hassony2/inria-research-wiki GitHub Wiki
CVPR 2015
[1411.4038] Fully Convolutional Networks for Semantic Segmentation [PDF] [notes]
Jonathan Long, Evan Shelhamer, Trevor Darrell
read 7/07/2017
Objective
Use Fully convolutional networks (FCN) to obtain pixel-level semantic segmentation
For this purpose, classification networks are adapted and repurposed to preserve spatial information.
Synthesis
From classification to heatmaps
Classification networks take fixed sized inputs and output a fixed size output (with the size being the number of classes)
To adapt them to variable size outputs we have to modify the fully connected layers and cast them as convolutional layers with kernels that cover the entire input region (previous layer).
The new output for an image larger then the one used for the original classification net therefore outputs heatmaps which correspond to various input patches, for each class which. However, this computation is done in a single pass, therefore sharing the computational burden for overlapping patches (compared to inputing various patches individually, which is equivalent in terms of final output).
These changes allows to train this new network on semantic segmentation by providing ground truth segmentations as outputs.
Unfortunately, this produces a very downsampled segmentation image due to the several downsampling layers which a used to keep the filters small and reduce computational requirements.
Convolutions
Each layer of data in a convnet has a size h x w x d (height x width x channel dimension)
Convolutions imply translation invariance as their basic component (conv, pool + non linearity function, which I'll call conv block) operate on local regions of the input image.
Stacking several conv blocks computes a nonlinear filter on the original input image.
Dilated convolutions
Same as a-trous convolutions, the idea is to spread the filter over a larger region by skipping some activations in the feature layer with a regular stride. This increases the size of the receptive field of the filter without upsampling
Wasn't actually used in final article's experiments
Upsampling by deconvolution
deconvolution <==> transposed convolution <==> upconvolution
Space input with zeros and then apply convolution, therefore increasing the size of the output compared to the input.
(See deconvolution and dilated convolutions for more details and illustrations)
A stack of deconvolution + non linearities can therefore learn a nonlinear upsampling.
Skip layers
Add skip layers to preserve local appearance information
Results
Used metrics
-
Pixel accuracy (proportion of correctly classified pixels among pixels in ground truth)
-
IU (intersection over union)