Practice - HoldenCaulfieldRye/DeepLearnNLP GitHub Wiki

Guides you through how to play with DeepLearnNLP to increase performance and see what it has learned.

Step 1: Get DeepLearnNLP up and running
Step 2: Play with DeepLearnNLP's parameters to increase performance
Step 3: See what DeepLearnNLP has learned (you might just want to skip to this)

Step 1: Get DeepLearnNLP up and running

Note: You need to have Matlab installed. Matlab isn't open source, which isn't cool, so I will try to get a python equivalent working

pull the DeepLearnNLP repo
double click on raw_sentences.txt to check out the raw data: the sentences are chopped up into 4-word phrases which are used to train the network to predict the 4-th word in a phrase
in shell: matlab data.m
typing load data.mat into your matlab/octave terminal organises the raw data
data.vocab prints the (admittedly limited!) vocabulary that the network is to be trained on
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100) pre-processes the data (Ctrl-Y TO PASTE ON MATLAB LINUX!): loads the data, separates it into inputs and target, and makes mini-batches of size 100 for the training set
double click on train.m and fprop.m to take a look at the code for training and backpropagation
you can now start training the network! model = train(k) will train the network for k epochs, and assign the result to model

Step 2: Play with DeepLearnNLP's parameters to increase performance

Note 1: You can tweak yourself each of the parameters mentioned by modifying train.m
Note 2: Unless explicitly mentioned, each tweak mentioned below is (to the best of my knowledge) the optimal tweak, found as the result of trial and error
Note 3: Apart from the last one, each tweak is evaluated by training on one epoch. Note 4: The comments below assume knowledge equivalent to the contents of the [wiki intro] (https://github.com/HoldenCaulfieldRye/DeepLearnNLP/wiki/Intro)

default parameter values

With

batchsize = 100;  
learning_rate = 0.1;   
momentum = 0.9;  
numhid1 = 50;         % number of neurons in upper hidden layer  
numhid2 = 200;        % number of neurons in lower hidden layer  
init_wt = 0.01;       % standard deviation of Gaussian pdf from which to sample initial weights

the error, measured by cross entropy on the test set, is 3.6.

Initial weights

Because gradient descent finds the local minimum with respect to weights, initial weights are important. In this network, initial weights are randomly sampled from a Gaussian probability distribution with zero mean, and a programmer-defined standard deviation, denoted init_wt.

With

batchsize = 100;
learning_rate = 0.1;
momentum = 0.9;
numhid1 = 50;
numhid2 = 200;
init_wt = 0.3;

the cross entropy on the test set is 2.83.

Optimal init_wt was 0.3, 30 times higher than default value. It seems that there are some deeper bowls (or to use the wiki's colloquial term, 'craters') in the error function a little further away from origin than 0.01 std dev can reach.

With init_wt = 1, final CE was bad, so it seems there are just lots of bowls of varying depth, and possibly they start getting shallower past a certain distance from the origin. Interesting to see that the higher init_wt, the higher the initial CE, so extreme weight values yield bad results on average.

Learning rate

The learning rate is the extent to which the weights are updated in the direction given by the partial derivatives estimated on the current mini batch. A high value can speed up convergence by bringing the weights to the bottom of the bowl with greater strides. But towards the end of training, the risk is skipping past the minimum and oscillating around it.

With

batchsize = 100;
learning_rate = 0.6;
momentum = 0.9;
numhid1 = 50;   
numhid2 = 200;
init_wt = 0.4;

[...]

if this_chunk_CE < 4.5  
learning_rate = 0.2; 
end
if this_chunk_CE < 2.9
learning_rate = 0.1;
end  
if this_chunk_CE < 2.8
learning_rate = 0.05;
end

the cross entropy on the test set is 2.746.

The learning rate decreases with cross-entropy. This allows for rapid convergence at the beginning, when the network is far from trained and making big strides carries little risk. As weights converge, the learning rate decreases in order to prevent taking wrong steps.

Momentum

Adds a fraction m of the previous weight update to the current one. This avoids the system from converging to poor local minima. High momentum also helps to increase the speed up convergence at the beginning, but creates a risk of overshooting the minimum. A momentum coefficient that is too low cannot reliably avoid local minima, and can also slow down the training of the system.

momentum=0.9 was found to be the optimal configuration in this case.

Batchsize

Number of training cases to combine in order to estimate partial derivatives. Large batchsize is more accurate, but there will be fewer weight updates, ie less optimisation. The idea below is to increase the batch size as weights converge and precision becomes more important.

With

Batchsize = 50; 
learning_rate = 0.6;
momentum = 0.9;
numhid1 = 50; 
numhid2 = 200;
init_wt = 0.4;

[...]

if this_chunk_CE < 4.5
    learning_rate = 0.2;
    batchsize = 100;
end
if this_chunk_CE < 3.2 
    learning_rate = 0.1;
    batchsize = 500;
end
if this_chunk_CE < 2.8
    learning_rate = 0.05;
    batchsize = 1000;
end

The cross-entropy on the test set increased to 3.764. However, the training cross-entropy (ie measure of error on the training set), which until now had always been close to the cross-entropy on the test set, reached 0.354. This is a case of overfitting.

To understand why, one must think of the statistical properties of estimating partial derivatives sampled over a training batch. The smaller the batch, the higher the standard error of the partial derivative estimator, the higher the imprecision. Ironically, this imprecision has a virtue: by sending weight updates in slightly random directions, it increases the robustness of the network, and prevents overfitting on the training data.

This and other parameter tweaks turned out to be suboptimal, so the default value was kept.

Number of Neurons

Increasing the number of neurons per layer raises the number of features the network can learn, but requires greater computational resources. Allowing the model to pick up too many features can lead it to pick up patterns that are in the training set by chance and not present in the population, which creates overfitting. Not having enough neurons prevents the model from learning features which are representative of the population, which creates underfitting.

With

min_validation_CE = 100;

[...]

batchsize = 100; 
learning_rate = 0.6;
momentum = 0.9;
numhid1 = 400; 
numhid2 = 400;
init_wt = 0.3;

[...]

if this_chunk_CE < 4.5  
learning_rate = 0.2; 
end
if this_chunk_CE < 2.9
learning_rate = 0.1;
end  
if this_chunk_CE < 2.8
learning_rate = 0.05;
end

[...]

if CE - min_validation_CE > 0.05
      force_end = true;
      break;
end
if CE < min_validation_CE
  min_validation_CE = CE;
end

[...]

the test cross entropy is 2.651.

Because the number of neurons per layer is much larger, code was added to stop training as soon validation cross entropy increased significantly (ie as soon as there were clear signs of overfitting). Indeed, training stopped after 2 epochs (whereas it was programmed to go up to 10).

Being able to improve the fit with such large layers for such a task is a sign that the network might perform better with another hidden layer. To understand why, think of each neuron as increasing the size of function space that the network searches for to find the one that will best perform its task. It can be shown mathematically that adding another layer of k neurons is equivalent to adding a lot more than k neurons in existing layers. It is therefore a more efficient way of increasing the power of the network. I plan to try it and update the code and this page soon.