Intro - HoldenCaulfieldRye/DeepLearnNLP GitHub Wiki

From Machine Learning Concepts to Deep Convolutional Neural Network Concepts

To give you an intuitive but deeper understanding for how deep learning works, and the specifics of the convolutional neural network architecture used in DeepLearnNLP.

Basics

The motivation for machine learning is to find the function that maps an input of interest to an output of interest, with a very high degree of accuracy. The objectives are to find (1) as quickly as possible the (2) most accurate function in the (3) broadest possible set of candidate functions. That's pretty abstract, so it might help to think of examples of 'inputs of interest' and 'outputs of interest':

  • finding the function that maps 'facebook photos' to 'facebook users present in it'
  • finding the function that maps 'facebook inbox message text' to 'goods or services sought by user (if any)'
  • finding the function that maps 'tourist smartphone photos' to 'the monument present in it' (then link to wikipedia page)
  • finding the function that maps 'stock market data' to 'future stock prices'
  • finding the function that maps 'google car video footage' to 'road signs and human obstacles present'
  • finding the function that maps 'medical patient's symptoms & data' to 'illness'
    The list goes on... if you can think of any original ones, add them here!

Techniques

  • Supervised & Unsupervised
  • Classification & Regression
  • Reinforced

Supervised Classification Example

Let's say we need to detect a pattern from a set of known patterns using a 2-tone image of 10x10 pixels as input. The known patterns are our target set. That's a total possible input set size of 2^100. Creating a examples for each pattern in our target set we can create a training set. Each element in the training set is labeled with it's matching element in the target set.

The training / learning phase now uses the training data as it begins to formulate the desired function. Any result can be tested against a test set and using the resulting accuracy in further training.

It's supervised because the data is labeled and it's classification because there can only be one correct answer from a known set.

Unsupervised learning with regression

Training using unlabeled data lets the system find previously unknown ways of grouping the input, finding relationships (clustering). If the outcome contains variables (a formula for example) it's called regression.

To do : example.

Reinforced Learning

"...finding suitable actions to take in a given situation in order to maximize reward" (Bishop, Pattern Matching and Machine Learning, pg 3)
Finding the result is done by a process of trial and elimination. Using a mix of exploration (trying out new actions) and exploitation (using actions that previously led to a positive outcome).

Neural Nets

A neural net is a graph where nodes represent neurons, and edges represent synapses (or data buses). Data travels from one end of the network to the other via the edges. Each neuron's output is the result of some operation on the data it receives. The graph is made up of several layers. The top layer is the output layer, where each neuron represents a class known by the classifier. The output value of neuron k is the probability that the candidate being fed into the network is a member of class k. As for the other layers, it depends. See below.

Peceptrons

A feed-forward neural net is called a peceptron if there exist no layers between the input and output layers. In that case, the bottom layer needs to be a vector of features from the image (eg how many right angle corners are in the image, what percentage of the pixels are green etc). These features are defined by a computer vision programmer. The machine learning consists in using the labeled to data to figure out the relative importance of the features. The relative importance is expressed as weights, which can then be used to classify better. But as you can see, the difficulty lies in specifying features correctly. This is much easier to do with facial recognition than with recognising hundreds of plant species in cluttered backgrounds.

Deep Neural Nets

The neural net is "deep" if there exists >0 intermediate layers. These layers are called hidden layers, because they are not entirely under the programmer's control: the programmer doesn't hard-code what they do, but rather leaves it under the learning algorithm's control. What often ends up happening is that the hidden layers become feature detectors. Moreover, a higher hidden layer ends up detecting the presence of features which combine, build upon and complexify the features detected in the layer below. The top layer can be viewed as doing the same, where a class specification is a specific combination of very complex features.

With deep neural nets, the bottom layer is the input layer, where each neuron takes as input a component of the data vector. So the number of input neurons is the dimensionality of the data. Our input vector will contain at least as many dimensions as there are pixels in the image to recognise. It might also have neurons for GPS data, or for multiple images if the user is multi-querying. The dimensionality is going to be high!

Deep neural nets are a good strategy for learning features because they can learn non-linear patterns. Features are typically complex patterns, and complex patterns are typically non-linear. It was mathematically proven in 1969 that a peceptron cannot learn a non-linear pattern.

Macro Architecures

In the nets discussed above, for any non-output-layer k, data in layer k flows to layer k+1. This flow is imposed, and defines a Feed Forward macro architecture. (macro because it's a key aspect of the architecture but lots of other aspects exist as well.) But there are actually other types of macro architecture (that we probably won't need to worry about in this project).

Feed-Forward neural networks

   -> the ones discussed above

Recurrent neural networks

-> hidden layers contain directed cycle(s)
	-> creates complex dynamics, making them difficult to train (convergence issues?)
	-> lots of interest at the moment in finding efficient ways of training recurrent neural nets
		-> because v powerful if can be trained
		-> also because more biologically realistic

-> v natural way to model sequential data
	-> connections betw hidden nodes?
	-> hidden units form a subnetwork that is very deep in time
	-> at each time step, the values of the hidden nodes determine the values of the hidden nodes of the next time step
	-> weights are constant through time
	-> all hidden nodes receive input and emit output at every time step
		-> combined with cycles, this STORES MEMORY (temporarily, but for a potentially long time)
			-> but very difficult to train them to use this ability
			-> however, recent algos have succeeded! read Ilya Sutskever (2011)

Symmetrically connected networks

-> for every adjacent nodes u and v, w(uv) = w(vu)
-> symmetric networks are much easier to analyse than recurrent networks
-> but more restricted in what they can do, because they obey an energy function
	-> cannot model cycles

Models of Neurons

This section is not yet well written. I'll make it better if you like.

Linear Neuron:

-> y = b + sum(i=1:n) {x[i]*w[i]}
-> y: output
-> b: bias
-> x: inputs
-> w: weights
    -> n: number of input neurons ie neurons sending their output to the given neuron

Binary Threshold Neuron:

-> y = 1 if M <= b + sum(i=1:n) {x[i]*w[i]}
       0 otherwise
-> M is the threshold
-> y takes a hard decision, just like biological neurons
	-> so y produces spikes
-> x[i] is the truth value of a relevant proposition
-> a way of combining information from many (possibly unreliable) sources of information
    -> n: number of input neurons ie neurons sending their output to the given neuron

Rectified Linear Neuron:

-> y = max{ 0 ; b+sum(i=1:n){x[i]*w[i]} }
            -> b can be negative!
-> semi soft, semi hard decision making
	-> below certain threshold I jump to certainty, but above that my confidence is incremental
	-> doesn't produce spikes like a biological neuron
	-> may produce zero flow, and only ever increases flow gradually

Logistic aka Sigmoid Neuron:

-> y = 1 / (1 + exp(-z))
   z = b + sum(i=0; i<n; i++) {x[i]*w[i]}
-> uses logistic function, logistic(z)
-> most commonly used
	-> fully differentiable
	-> bijection between ]-infinity; +infinity[ and [0;1]
	-> y(0) = 0.5
	-> gets harder and harder to convince me further
     -> computationally expensive
            -> cuda-convnet uses rectified linear neurons instead for that reason

Stochastic Binary Neuron:

-> P(y=1) = 1 / (1 + exp(-z))
     P(y=0) = 1 - P(y=1)
-> y is a Bernouilli random variable whose prob mass function is given by logistic value
-> the number of spikes produced per time frame follows a Poisson distribution, the parameter of which is deterministic
	-> parameter is a function of logistic value

Softmax Neuron

Equation

y[j] = exp(z[j])/(sum(k=1:n) {exp(z[k])}) this enforces sum(j=1:n) {y[j]} = 1 conveniently, d_y/d_z = y(1-y)

Intuition

Suppose that in a layer, we are trying to assign probabilities to mutually exclusive classs (often the case for the output layer). We know that the sum of the layer's outputs should be 1, but we are depriving the network of this knowledge! Eg "prob class 1 is 3/4 and prob class 2 is 3/4" is a crazy answer, we ought to tell our network this somehow. We can communicate this with the softmax neuron.

Limitations of squared error for training a softmax neuron

Suppose task is classification, so for a given training case/object, value of output neuron[k] is the probability that object is of class k. Suppose true label is k but output neuron[k] outputs the value 0.1. That's a pretty big mistake so network should be punished ie error should be high. But with squared error, error = (1-0.1)^2 = 0.81. That's tiny relative to the span of values error function can take! Weights will be under-updated, learning will be slow. We need an error function which assigns high values when diff1, small values when diff0. Note however, that squared error is good for training linear neurons. Consider the absolute value between output value and target value (ie correct output value). Call this absolute value the 'diff'. In general, squared error is good for neurons with output domains such that diff can go from 0 to infinity. With softmax neurons however, diff goes from 0 to 1.

Optimal error function: cross-entropy

CE = -sum(j=1:m){t[j]*log(y[j])} = H(q) + D(q||p) Where m is the number of neurons in the layer, p is the discrete probability distribution (ie a pmf, probability mass function) defined by the y[j] ie the output of the layer, q is the discrete probability distribution defined by the t[j] ie the target output of the layer, H(q) is the entropy of q (a measure of the uncertainty in q as a pmf), D(q||p) is the Kullback-Leibler divergence of q with respect to p (a bit like a measure of the 'distance' between q and p). Notice the very high gradient when target value = 1 and output~0.

The Backpropagation Algorithm

The Wikipedia article on it is great. http://en.wikipedia.org/wiki/Backpropagation But if you want something more superficial, read on.

This supervised learning algorithm is used to train deep feed-forward neural networks. Its aim is to find values for all the weights in the network that minimise the network's prediction error (you need labeled data to compute prediction error, hence why it only works with supervised learning). As a 1st step, it efficiently estimates the partial derivative of the error with respect to every weight (ie "how much would the error increase by if I were to slightly increase the value of this weight?"). As a 2nd step, it uses these partial derivative estimates to update ie learn the weight values.

Compute Error-Weight Partial Derivatives

This might seem straightforward: tweak the value of one weight, see how the error changes, and you've got your partial derivative with respect to that weight. That would be called randomly perturbing weights. But computing the error means feeding a training case through the entire network. You also need a statistically significant estimate, so you'd have to measure the error with many (typically 100) cases. You'd have to do all of that, one weight at a time (otherwise you don't know which of the weights you've simultaneously tweaked are responsible for what proportion of the error). With cuda-convnet, 60 million weights.
Geoffrey Hinton and others realised in the 80s that instead of randomly perturbing weights, you can randomly perturb the output of neurons (one by one), and use the error-neuronOutput partial derivatives to compute all of the error-weight partial derivatives (using the chain rule in calculus). In a 2-layer network of a+b neurons for example, you have a*b weights, so the efficiency gain is significant. This is the algorithm's achievement.
That's backpropagation stage 1.

Update Weight Values (With Gradient Descent)

At every learning iteration, backpropagation computes these slopes, and uses the values to update the weights. It stops updating weights once all slopes are zero (ie no small change in the value of any weight will reduce the error). That's gradient descent.
To get the intuition for learning in general, and gradient descent specifically: given K weights and an error function, picture the K+1 dimensional space (or rather picture it for K=2!) where the vertical dimension is where the error is plotted, and all of the other dimensions are where weight values are plotted. Now picture the surface that is obtained as the weight vector spans all of its possible values. (for K=2, picture the surface as plain ground with bumps and craters here and there).
Learning is simply about finding the lowest point on that surface, ie finding the weights that minimise the error.
Graphically, backpropagation stops once the bottom of a crater has been met. But we have no idea whether this crater is the deepest one. Worse even, by default (we'll have to check whether this is also the case with cuda-convnet) the first crater to be met is the final one. Backprop doesn't explore elsewhere to see if a deeper crater can be found.
If we're lucky, the error function has only one crater. Either that, or the initial weights - the ones the network began training with - are somewhere on the slope of the deepest crater. So the initial weights are crucial.
(And gradient descent seems pretty naive).

Deep Convolutional Neural Nets

Deep CNNs (often simply referred to as CNNs) are a type of deep neural nets. They were used by the winning teams at ImageNet 2012 and ImageNet 2013. They were used in 2012 for the first time at ImageNet and thrashed competing classifiers, which had been built by world class computer vision teams.

Dimension Hopping

CNNs seem to be the architecture of choice for image recognition '''with supervised learning''', because they are good at tackling dimension hopping in the data. With images, each pixel defines a dimension, and since a leaf (or any feature) can appear anywhere on the photo, this means a feature can appear in any group of dimensions. This is very difficult to deal with. For example, consider the classification task of diagnosing illnesses for medical patients based on symptom data. Imagine if the heartbeat data were to sometimes be written under the "height" section! It would be a total mess.

Replicated Feature Detection

CNN architecture is a way of replicating feature detection across the whole image. This is done by modifying the backpropagation algorithm into forcing certain edge weights to be the same. But since one feature has to be potentially picked up anywhere on the image, and there are lots of features to detect, hidden layers (especially the lower ones) have lots of neurons. But since weights are replicated, the number of parameters to optimise isn't as high, so it doesn't make the required computation impossible.

Training data vs Validation data vs Test data

Momentum

Data type: Real Domain: [0, 1] Typical value: 0.9 Meaning: Momentum simply adds a fraction m of the previous weight update to the current one. The momentum parameter is used to prevent the system from converging to a local minimum or saddle point. A high momentum parameter can also help to increase the speed of convergence of the system. However, setting the momentum parameter too high can create a risk of overshooting the minimum, which can cause the system to become unstable. A momentum coefficient that is too low cannot reliably avoid local minima, and can also slow down the training of the system.