dilated convolutions vs transposed convolutions - hassony2/inria-research-wiki GitHub Wiki

Dilated convolutions

Called "à trous" convolutions in the DeepLab semantic segmentation paper, but mostly referred to as dilated convolutions.

The goal of this layer is to increase the size of the receptive field (input activations that are used to compute a given output) without using downsampling (in order to preserve local information). Increasing the size of the receptive field allows to use more context (information spatially further away).

The output is therefore more dense then when we use convolutions and downsampling. The feature maps have actually the same size as the input features (except for border discarding in case where we don't use padding)

The idea is to spread the input images and fill the added pixels with zeros, and then compute a convolution.

dilatationconv

dilate input

This is also equivalent to adding 0 in the filters for all places where the input value should be ignored (but seeing it as skipping inputs is closer to the computationally effective approach of the problem)

Deconvolutions

Better name : Transposed convolutions

Here the idea is to upsample an initial layer, and therefore we spread the input by padding them with zeros and then apply a standard convolution (with strides of 1) on the resulting input representation.

In the original paper that introduced them, transposed convolutions were transposed versions of corresponding convolution from an encoding part of the network (while the transposed filters were part of the decoder part of the network)

As we can see in the illustration, if the stride is more then one, upsampling occurs.

strides

Unlike dilated convolutions, which have same output size as input size (if input borders are properly padded) "deconvolution" layers actually produce upsampling (larger input then output)