Course2 - WeiliangGuo/deepleaning_studies GitHub Wiki

Course 2: Setting up your ML application

Note: This course introduces common hyper-parameters in deep learning, properly setting them up are essential for successfully running algorithms like CNN, LSTM, etc.

2.1.1 Train/dev/test sets

After training, use dev set to see which of many different models is the best. After one final model was selected, test it on test set. Not having a test set might be okay.

When the whole available data set is very large, say, a million examples in the set, empirically the ratio between the training split and dev split should be about 99% vs 1% rather than 70% vs 30 % which is the case for not so large data set cases.

Make sure the examples in both training and dev sets are drown from sam statistical distribution.

2.1.2 Bias/Variance

When the training set error is acceptable (comparable to human judgement error) but the dev set error is relatively large, then we say the model may suffer high variance problem. ---> overfitting

If the training set error is not so ideal compared to human judgement error, and the difference between training error and dev set error is not so large, then we say the model may suffer high bias problem. ---> underfitting

In other words, the most desired model is not always the one which almost perfectly fits the training set but performs poorly on dev set, instead, it might allow a reasonable amount of errors on training set yet evaluated decently on dev set as well.

If the training error is unideal and the dev error is even much worse, then we say the model may suffer both high bias and high variance problem.

The human judgement error can also be regarded as bayesian error or optimal error.

Should we use cross validation in deep learning?

What can actually cross validation be used for?

2.1.3 Basic recipe for machine learning

When high bias happens, try bigger network, such as more hidden layers or hidden units, or you could train it longer, or try some more advanced optimization algorithms.

After solving the high bias problem, then you check if the model also suffers high variance. If so, you should use more training data when possible. Or try regularization.

For traditional supervised learning algorithms, you need to pay more attention to bias-variance trade-off problem. But this problem is less severe for deep learning if the neural net is both large enough and well regularized.

2.1.4 Regularizing your neural network

When the model is overfitted(high variance), you should regularize it especially more training data are not so easy to get.

L2 regularization is the most common one. Only regularizing the weights parameters and not the bias term because usually the weights are often of much higher dimensionality whereas the bias term is just a 1-D vector.

The regularization parameter (it's a hyper-parameter) λ should be fine-tuned, namely, you need to empirically try out different λ values on the dev set to see which one does the best.

2.1.6 Dropout regularization

Dropout is a popular deep learning regularization technique. During each iteration of training, we knock out neurons in each hidden layer with a certain probability. For example, if the probability was set as 0.5, half of all neurons will be randomly opted out which means they won't participate in the training process. When this happens, it means we train on a much smaller neural net by each iteration which helps preventing overfitting.

Dropout is also a hyper-parameter that needs to be fine-tuned.

2.1.9 Normalizing inputs

When features are from very different scales, then it's necessary to normalize them to make the training process become faster.

One step is to subtract each input by the mean of all inputs. Then divide each input by the variance. By normalizing the gradient scent will hopefully converge faster.

2.1.10 Vanishing/exploding gradients

Assuming the neural net is very deep, when all weight matrices are bigger than their corresponding identity matrix, the gradients will explode, or in opposite they will vanish.

2.1.11 Weight initialization for deep nets

A partial solution to vanishing or exploding gradients problem is to choose the weight initialization more carefully.

2.1.14 Gradient checking implementation notes

Don't use it when training.

Andrew's previous tutorials explain better about gradient checking.

2.2.2 Understanding mini-batch gradient descent

When the mini-batch size equals size of the whole training set, then it's called batch gradient descent.

When the mini-batch size equals to 1, then it's stochastic gradient descent.

The weights are updated after running each batch size of training example(s).

In practice, when training set is very large, the mini-batch should be neither too large nor too small (typical values: 64, 128, 256). It's a hyper-parameter. In contrast, with small training set, batch gradient descent works just fine.

Usually finding the best optimum for minimizing the cost function is either too time-consuming(batch gradient descent) or just impossible(SGD or mini-batch gradient descent), so the goal is just to find a reasonably good optimum.

2.2.5 Bias correction in exponentially weighted average

Bias correction helps reduce initial biases when starting EWA calculations. But this technique becomes trivial when you don't care about too much about the initial biases since after initial phase, bias problem reduces itself.

2.2.6 Gradient descent with momentum

When learning rate is too large, the gradient descent will oscillate too violently which might lead to diverging from the optimum. Besides, such oscillations slow down the learning process.

Momentum is a technique to reduce oscillations. During each iteration of training, instead of using the original partial derivatives of W and b, we use the EMA versions of them. Basically we are averaging out the gradients to smooth out the learning path towards the optimum so that the training converges faster.

β in EMA formula of momentum is a hyper-parameter, 0.9 is a commonly seen value in practice.

Adam optimization algorithm

An overview of gradient descent optimization algorithms

Adam algorithm combines RMSprop with momentum.

Default values for β_1 (RMSprop) and β_2 (Adam) are 0.9 and 0.999. Default value for ε is 10^-8.

Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

2.2.9 Learning rate decay

This means to allow the learning rate to decrease automatically and gradually. As we are approaching the optimum the gradient will be less and less oscillating, so that a descent optimum can be chosen faster.

2.3.1 Tuning process

The most important hyper-parameter to be tuned is learning rate α, Secondly are momentum β, number of hidden units and mini-batch size, thirdly are number of layers and learning rate decay. We usually use default values of β_1, β_2, ε.

Normalizing activations in a network

https://www.quora.com/Why-does-batch-normalization-help

Batch normalization potentially helps in two ways: faster learning and higher overall accuracy. The improved method also allows you to use a higher learning rate, potentially providing another boost in speed.

Why does this work? Well, we know that normalization (shifting inputs to zero-mean and unit variance) is often used as a pre-processing step to make the data comparable across features. As the data flows through a deep network, the weights and parameters adjust those values, sometimes making the data too big or too small again - a problem the authors refer to as "internal covariate shift". By normalizing the data in each mini-batch, this problem is largely avoided.

Basically, rather than just performing normalization once in the beginning, you're doing it all over place. Of course, this is a drastically simplified view of the matter (since for one thing, I'm completely ignoring the post-processing updates applied to the entire network), but hopefully this gives a good high-level overview.

Update: For a more detailed breakdown of gradient calculations, check out: Understanding the backward pass through Batch Normalization Layer

2.3.5 Fitting Batch Norm into a neural network

In practice, batch norm is usually applied with mini-batches.

2.3.6 Why does Batch Norm work?

An intuitive example about so-called covariate shift problem: You use a set of data which following a certain distribution to train a model to judge if a picture is cat or not. Then there is another set of data are also about cats and non-cats but with a different distribution, in such case, covariate shift occurs.

So batch norm is used to counter this problem hence speed up training, and it has a slight regularization(counter-overfitting) effect.

2.3.8 Softmax regression

Softmax regression can be used for multi-class classification tasks. Given an input, What softmax does is that it assigns a normalized conditional probability to each class label in the output. Here "normalized" means that all these probabilities should sum to 1. When there are only 2 possible class labels, then softmax is just logistic regression.

How to calculate softmax probabilities?

At the final hidden layer of a neural net, after getting outputs by applying activation function on each neuron, then the probability is represented by dividing each such output by the sum of all outputs.