Resources

Abstract

Our key insight is to build “fully convolutional” networks.
We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task.
We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.

This is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training.
This approach does not make use of pre- and post-processing complications.
We define a skip architecture to take advantage of this feature spectrum that combines deep, coarse, semantic information and shallow, fine, appearance information.

Fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions.
Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches.
An FCN naturally operates on an input of any size, and produces an output of corresponding (possibly resampled) spatial dimensions.
The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation.

Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.
Note that the deconvolution filter in such a layer need not be fixed (e.g., to bilinear upsampling), but can be learned.

The 32 pixel stride at the final prediction layer limits the scale of detail in the upsampled output. their output is dissatisfyingly coarse.
We address this by adding skips that combine the final prediction layer with lower layers with finer strides.
Combining fine layers and coarse layers lets the model make local predictions that respect global structure.
We add a 1 * 1 convolution layer on top of pool4 to produce additional class predictions. We fuse this output with the predictions computed on top of conv7 (convolutionalized fc7) at stride 32 by adding a 2* upsampling layer and summing6 both predictions. We call this net FCN-16s.
We continue in this fashion by fusing predictions from pool3 with a 2 upsampling of predictions fused from pool4 and conv7, building the net FCN-8s.