Exploring the Latent Space - jimenalozano/face-generator GitHub Wiki

What is Latent Space?

It simply means a representation of compressed data. If you have trained a model to classify digits, then you have also trained the model to learn the ‘structural similarities’ between images. In fact, this is how the model is able to classify digits in the first place- by learning the features of each digit. If it seems that this process is ‘hidden’ from you, it’s because it is. Latent, by definition, means “hidden.”

The concept of “latent space” is important because its utility is at the core of ‘deep learning’ — learning the features of data and simplifying data representations for the purpose of finding patterns

Explore the GAN Latent Space.

GANs

The generative model in the GAN architecture learns to map points in the latent space to generated images. The latent space has no meaning other than the meaning applied to it via the generative model. Yet, the latent space has structure that can be explored, such as by interpolating between points and performing vector arithmetic between points in latent space which have meaningful and targeted effects on the generated images.

Through training, the generator learns to map points into the latent space with specific output images and this mapping will be different each time the model is trained. The latent space has structure when interpreted by the generator model, and this structure can be queried and navigated for a given model. A series of points can be created on a linear path between two points in the latent space, such as two generated images. These points can be used to generate a series of images that show a transition between the two generated images. Finally, the points in the latent space can be kept and used in simple vector arithmetic to create new points in the latent space that, in turn, can be used to generate images. This is an interesting idea, as it allows for the intuitive and targeted generation of images.

In this latent space, faces and face features (e.g., maleness) can be represented as linear combinations of each other, and different concepts (e.g., male, smile) can be manipulated using simple linear operations.

FaceLatentSpace

StyleGAN2 Architecture and Latent Space

The distinguishing feature of StyleGAN is its unconventional generator architecture. Instead of feeding the input latent code z ∈ Z only to the beginning of a the network, the mapping network f first transforms it to an intermediate latent code w ∈ W. Affine transforms then produce styles that control the layers of the synthesis network g via adaptive instance normalization (AdaIN). Additionally, stochastic variation is facilitated by providing additional random noise maps to the synthesis network. It has been demonstrated that this design allows the intermediate latent space W to be much less entangled than the input latent space Z. Therefore, W is the relevant latent space from the synthesis network’s point of view.

One of the main strengths of StyleGAN is the ability to control the generated images via style mixing, i.e., by feeding a different latent w to different layers at inference time. In practice, style modulation may amplify certain feature maps by an order of magnitude or more.

StyleGAN2 Architecture

StyleGAN2 redesigned the architecture of the StyleGAN synthesis network. (a) The original StyleGAN, where A denotes a learned affine transform from W that produces a style and B is a noise broadcast operation. (b) The same diagram with full detail. Here we have broken the AdaIN to explicit normalization followed by modulation, both operating on the mean and standard deviation per feature map. We have also annotated the learned weights (w), biases (b), and constant input (c), and redrawn the gray boxes so that one style is active per box. (c) We make several changes to the original architecture that are justified in the main text. We remove some redundant operations at the beginning, move the addition of b and B to be outside active area of a style, and adjust only the standard deviation per feature map. (d) The revised architecture enables us to replace instance normalization with a “demodulation” operation, which we apply to the weights associated with each convolution layer.

A fixed-size step in W results in a non-zero, fixed-magnitude change in the image. We can measure the deviation from this ideal empirically by stepping into random directions in the image space and observing the corresponding w gradients. These gradients should have close to an equal length regardless of w or the image-space direction, indicating that the mapping from the latent space to image space is well-conditioned.

Projection of images to latent space

The generator uses a mapping function g(w): W -> Y to map a latent code to an image. w ∈ W, y ∈ Y, y image.

Inverting the synthesis network g is an interesting problem that has many applications. Manipulating a given image in the latent feature space requires finding a matching latent code w for it first. Previous research suggests that instead of finding a common latent code w, the results improve if a separate w is chosen for each layer of the generator. The same approach was used in an early encoder implementation. While extending the latent space in this fashion finds a closer match to a given image, it also enables projecting arbitrary images that should have no latent representation. Instead, we concentrate on finding latent codes in the original, unextended latent space, as these correspond to images that the generator could have produced.

Given a target image x, we seek to find the corresponding w ∈ W and per-layer noise maps denoted n_i ∈ R r_i×r_i where i is the layer index and r_i denotes the resolution of the ith noise map. The baseline StyleGAN generator in 1024×1024 resolution has 18 noise inputs, i.e., two for each resolution from 4×4 to 1024×1024 pixels. Our improved architecture has one fewer noise input because we do not add noise to the learned 4×4 constant. Before optimization, we compute µ_w = E_z f(z) by running g 10 000 random latent codes z through the mapping network f. We also approximate the scale of W by computing σ^2_w = E_z||f(z) - µ_w||, i.e., the average square Euclidean distance to the center. At the beginning of optimization, we initialize w = µ_w and n_i = N(0, 1) for all i.

The trainable parameters are the components of w as well as all components in all noise maps n_i. The optimization is run for 1000 iterations using Adam optimizer with default parameters. Maximum learning rate is λmax = 0.1, and it is ramped up from zero linearly during the first 50 iterations and ramped down to zero using a cosine schedule during the last 250 iterations. In the first three quarters of the optimization we add Gaussian noise to w when evaluating the loss function. This adds stochasticity to the optimization and stabilizes finding of the global optimum.