Style Transfer - abhigarg/DeepLearningNotes GitHub Wiki

Style Representation

Style of an image is represented in terms of local image structures captured at different scales. Deep convolution neural networks captures local image structures at different scales as the receptive field size increases with increasing depth of the network hierarchy. With increase in depth, the network can learn the representation of bigger and more complex local image structures. The style representation simply computes correlations between different types of neurons in the network.

Style Transfer

The principle of style transfer is based on seperability of content and style representations by CNNs. Using this, while the style of an image can be copied to another image while maintaining the content of the destination image. There are two components in style transfer one maintains the content of the original image while the other make sure the style matches with that provided by the artwork or style generator. The two components are simulaneously optimized by designing a loss function which makes sure that original content is retained and style matches with the style from the artwork.
The Deep Convolution Neural Networks performing highly for tasks like object recognition have this ability to separate content from style in natural images. When a deep neural network is trained to recognize an object in image, it has to become invariant to all image variations that preserves object identity.

Capturing Style & Content

The content of an artwork is defined as the feature maps produced at layer at level l in the deep neural network. Say a network is trained to capture content then the loss function is defined in terms of squared-error loss between feature maps produced by the lth layer of neural network (say Pl) for original image and feature maps produced for generated image (Fl). When the loss is minimized then the network is said to be capturing content of the original image. The random white noise image is used as initial image from which the new image is generated at every iteration.

The style of an artwork is defined as the correlation as Gram Matrix Gl at layer l where Gl is inner product between vectorized feature maps i and j at layer l. To capture the style of an artwork image, the mean-squared distance between the Gram matrices from original and generated image is minimized. The loss function for each layer is weighted by a factor wl and the total loss for all the layers is a weighted sum of loss at each layer.

The joint loss function to minimise content and style losses is defined as weighted sum of content loss and style loss: Total Loss = alpha x Loss content + beta x Loss Style
Here the content loss is computed relative to the original image and style loss is computed relative to the artwork.

Neural network modele

VGG-network layers are used for producing content and styles. The fully connected layers are not used. The max-pool layers are replaced with average-pool to improve appeal of the results.