1411.4038 - hassony2/inria-research-wiki GitHub Wiki

CVPR 2015

[1411.4038] Fully Convolutional Networks for Semantic Segmentation [PDF] [notes]

Jonathan Long, Evan Shelhamer, Trevor Darrell

read 7/07/2017

Objective

Use Fully convolutional networks (FCN) to obtain pixel-level semantic segmentation

For this purpose, classification networks are adapted and repurposed to preserve spatial information.

Synthesis

From classification to heatmaps

Classification networks take fixed sized inputs and output a fixed size output (with the size being the number of classes)

To adapt them to variable size outputs we have to modify the fully connected layers and cast them as convolutional layers with kernels that cover the entire input region (previous layer).

The new output for an image larger then the one used for the original classification net therefore outputs heatmaps which correspond to various input patches, for each class which. However, this computation is done in a single pass, therefore sharing the computational burden for overlapping patches (compared to inputing various patches individually, which is equivalent in terms of final output).

These changes allows to train this new network on semantic segmentation by providing ground truth segmentations as outputs.

Unfortunately, this produces a very downsampled segmentation image due to the several downsampling layers which a used to keep the filters small and reduce computational requirements.

Convolutions

Each layer of data in a convnet has a size h x w x d (height x width x channel dimension)

Convolutions imply translation invariance as their basic component (conv, pool + non linearity function, which I'll call conv block) operate on local regions of the input image.

Stacking several conv blocks computes a nonlinear filter on the original input image.

Dilated convolutions

Same as a-trous convolutions, the idea is to spread the filter over a larger region by skipping some activations in the feature layer with a regular stride. This increases the size of the receptive field of the filter without upsampling

Wasn't actually used in final article's experiments

Upsampling by deconvolution

deconvolution <==> transposed convolution <==> upconvolution

Space input with zeros and then apply convolution, therefore increasing the size of the output compared to the input.

(See deconvolution and dilated convolutions for more details and illustrations)

A stack of deconvolution + non linearities can therefore learn a nonlinear upsampling.

Skip layers

Add skip layers to preserve local appearance information

Results

Used metrics

Pixel accuracy (proportion of correctly classified pixels among pixels in ground truth)
IU (intersection over union)