Deep Learning and GANs - jimenalozano/face-generator GitHub Wiki

Artificial Intelligence

“The true challenge to artiﬁcial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images” - Ian Goodfellow

Artificial intelligence is a field that continues to grow, and very fast, with many practical applications and research. Why? Because we are looking for "smart" software to automate routine work, such as understanding speech or images, making medical diagnoses, etc.

In the early days of artificial intelligence, the field quickly tackled and solved problems that are intellectually difficult for humans but relatively straightforward for computers, problems that can be described by a list of mathematical rules. The real challenge for artificial intelligence turned out to be solving tasks that are easy for people to do but difficult to describe in mathematical rules, problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in pictures.

The daily life of a person requires an immense amount of knowledge about the world. Much of this knowledge is subjective and intuitive and therefore difficult to articulate in a formal way. Computers need to capture this same knowledge in order to behave intelligently. One of the key challenges in artificial intelligence is how to bring this "informal" knowledge to a computer.

This project is not only about a particular solution, it is also about a solution to these more intuitive problems. This solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined through its relationship to simpler concepts. By gathering knowledge from experience, this approach avoids the need for humans to formally specify all the knowledge the computer needs. The concept hierarchy allows the computer to learn complicated concepts by building on simpler ones. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to artificial intelligence deep learning.

Machine learning and Autoencoders

Artificial intelligence systems need the ability to acquire their own knowledge, they do this by extracting data patterns. This ability is known as machine learning. The performance of these machine learning algorithms depends to a great extent on the representation of the data that is provided to them. Many AI programs can be solved by designing the correct set of features to extract for that program, then providing these characteristics to a machine learning algorithm.

For many tasks, however, it is difficult to know which features to extract. For example, it is very hard to describe exactly what a face looks like in terms of pixel values. One solution to this problem is to use machine learning to discover not only the mapping from representation to output, but also the representation itself. This approach is known as representational learning. A representation learning algorithm can discover a good set of characteristics for a simple task in minutes or for a complex task in hours. A very popular example of a representation learning algorithm is the autoencoder.

An autoencoder is the combination of an encoder, which converts the input data into a different representation, and a decoder, which converts the new representation back to the original format. Autoencoders are trained to preserve as much information as possible when an input is executed through the encoder and then the decoder, but they are also trained to make the new representation have various properties that make it easier for us to experiment with this new data representation.

When designing functions or algorithms for learning functions, our goal is usually to separate the variation factors that explain the observed data. These factors are often not directly observed. Instead, they can exist as unobserved objects or unobserved forces in the physical world that affect observable quantities. They can also exist as buildings in the human mind that provide useful simplifying explanations or inferred causes from observed data. They can be thought of as concepts or abstractions that help us make sense of the rich variability of the data. For example, with faces, the face frame is a variation factor that greatly influences us when it comes to recognizing a face in an image.

The data is compressed in machine learning to obtain important information about the characteristics of the data. As the model "learns," it simply learns features on each layer (edges, angles, etc.) and associates a combination of features to a specific output. But every time the model learns through a data point, the dimensionality of the image decreases first before finally increasing. Whenever we graph points or think of points in latent space, we can imagine them as coordinates in space where points that are "similar" are closer together on the graph. But what makes two faces "more similar"? A face has distinguishable characteristics (ie, skin color, eye shape, space between eyebrows). All of these can be learned by our models learning patterns in edges, angles, etc. Thus, as dimensionality is reduced, "foreign" information that is distinct from each image (i.e. image background) is "removed" from our representation of latent space, since only the most important characteristics of each image are stored in latent space representations.

GANs

Generative Adversarial Networks (GAN) are a relatively new concept in Machine Learning, introduced for the first time in 2014 by Ian Goodfellow and his colleagues. Their goal is to synthesize artificial samples, such as images, that are indistinguishable from authentic images. GANs are very effective at the generation of large resolution and high quality images, which is our goal for this project, using face images as the input dataset.

A GAN network consists of the interaction between two networks. A Generator network and a Discriminator network.

As its name implies, the Generator network is in charge of taking noise as input, for example a vector of a simple distribution (such as Gaussian or uniform) and transforming that noise into a sample of the distribution that we are looking for. For facial images, the noise would be transformed into an image of a face because we are looking for the "true" distribution of faces.

The Generator network would be the transformation from one distribution to another, where the other distribution is very complex. But how does the Generator network learn? The other network helps us with that.

The second network, the Discriminator, will receive inputs that are images and its main task will be to discriminate what is real and what is false. The discriminator will start out being quite bad at its job but once trained it will understand what is real and what is false (or try to understand it, because the generator will try hard to confuse it). The output of the Discriminator network of classifying the output generated by the Generative network will function as feedback to learn how to improve its generations. The output of the discriminator represents the probability that the input is true, so it will be a number between 0 and 1. 0 would indicate that it is false, 1 real and 0.5 would imply that the discriminator does not know how to distinguish, which is the same probability as throwing a coin.

The generating network will want to generate faces as realistic as possible in such a way that the discriminator's confidence about what is real and what is not decreases. Ideally, the discriminator should return "I don't know" both when a user asks for real data or when he asks for generated data.

ProGANs

Although GANs seem to be a very fine design, the first implementations had the same problem: it took too long to train both networks in order to obtain realistic results. One of the first and many implementations of GANs, ProGANs, introduced a key innovation to make this easier: progressive training - training the generator with a very low-resolution image (4×4) and adding a higher resolution layer every time (being 1024x1024 the last layer). Initially learning a simple problem before progressing to learning more complex problems. This meant lower training time for lower resolutions, and using the lower-resolutions training in higher-resolutions training too. Overall, training time decreased significantly, and high resolutions results were achieved.

Although progressive training meant deconstructing the GAN architecture and train multiple times by layers, this architecture was only used to speed up training. Authors of StyleGAN (Style-Based Generator Architecture for Generative Adversarial Networks) saw this innovation and further used it to explore the latent space en each layer. As said by StyleGAN authors:

"Yet the generators continue to operate as black boxes, and despite recent efforts, the understanding of various aspects of the image synthesis process, e.g., the origin of stochastic features, is still lacking. The properties of the latent space are also poorly understood, and the commonly demonstrated latent space interpolations provide no quantitative way to compare different generators against each other."