Lecture 6 - AsyDynamics/CS231n GitHub Wiki

Other notes

The function of artificial neuron - calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not, $ Y = W * x + b $
Now the value of Y could be -inf to +inf, the neuron doesn't know the bounds to fire
Then the activation function is introduced for this purpose
Multiple types of activation function, step function, sigmoid function, linear function, tanh function, ReLu function (max(0,output)), in other words, thresholded at zero.
sigmoid - saturated neurons kill gradient; exp compute expensive; output not zero centered
ReLu - may die! People tend to initialize with slightly positive biases; not zero centered
Leaky ReLu - does not saturate, not die. $ max(alpha*x,x) $

reference: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10
Hyperparameter that control how much we adjust the weights with respect to loss gradient
new_weight = existing_weight — learning_rate * gradient
Lower value - slower travel along the downward slope
Better way to determine the learning rate - start with a rather low rate, increase it linearly or exponentially at each iteration
In keras, use

Small (random) weights may work fine with small neural networks, but when it gets larger or deeper, all activations become zero.

make unit gaussion activation

compute empirical mean and variance for each dimension
normalize
insert after fully connected/convolutional layer, before nonlinearity
network squash the range to $ y=gamma*x+beta $
benefit: allow higher learning rate; improve gradient flow; reduce dependence on initialization; as as regularization;
in practical, the mean and std not computed based on batch, instead use fixed empirical value

start with one hidden layer
disable regularization, check loss, if 2-3 it's okay for 10 class
enable regularization, loss go up, okay
start train, with disabled regularization, simple vanilla SGD
enable regularization and find the learning rate that makes loss go down
conclusion 1: loss not go down, too low learning rate; loss explode, too high learning rate
rough learning rate 1e-3 1e-5 (for all instance??)

cross-validation strategy