Lecture 6 - AsyDynamics/CS231n GitHub Wiki
Other notes
Activation function
- The function of artificial neuron - calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not, $ Y = W * x + b $
- Now the value of Y could be -inf to +inf, the neuron doesn't know the bounds to fire
- Then the activation function is introduced for this purpose
- Multiple types of activation function, step function, sigmoid function, linear function, tanh function, ReLu function (max(0,output)), in other words, thresholded at zero.
- sigmoid - saturated neurons kill gradient; exp compute expensive; output not zero centered
- ReLu - may die! People tend to initialize with slightly positive biases; not zero centered
- Leaky ReLu - does not saturate, not die. $ max(alpha*x,x) $
Learning rate
- reference: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10
- Hyperparameter that control how much we adjust the weights with respect to loss gradient
- new_weight = existing_weight — learning_rate * gradient
- Lower value - slower travel along the downward slope
- Better way to determine the learning rate - start with a rather low rate, increase it linearly or exponentially at each iteration
- In keras, use
Notes 1
Single neuron as linear classifier
- Binary Softmax classifier
- Binary SVM
- Regularization interpretation
Lecture 6
Cross validation
Data preprocessing
- step 1: Preprocess, zero-centered or normalized, PCA, whitening,
Weight initilization
- Small (random) weights may work fine with small neural networks, but when it gets larger or deeper, all activations become zero.
Batch normalization
make unit gaussion activation
- compute empirical mean and variance for each dimension
- normalize
- insert after fully connected/convolutional layer, before nonlinearity
- network squash the range to $ y=gamma*x+beta $
- benefit: allow higher learning rate; improve gradient flow; reduce dependence on initialization; as as regularization;
- in practical, the mean and std not computed based on batch, instead use fixed empirical value
Babysitting learning process
- Preprocess data
- zero centered; normalized;
- Choose architecture
- start with one hidden layer
- disable regularization, check loss, if 2-3 it's okay for 10 class
- enable regularization, loss go up, okay
- start train, with disabled regularization, simple vanilla SGD
- enable regularization and find the learning rate that makes loss go down
- conclusion 1: loss not go down, too low learning rate; loss explode, too high learning rate
- rough learning rate 1e-3 1e-5 (for all instance??)
Hyperparameter searching
cross-validation strategy
- first, few epochs to get rough idea what params work
- second, longer running time, finer search; random search vs grid search
- monitor and visualize the the loss curve, if big gap, overfitting