Lecture 6 - AsyDynamics/CS231n GitHub Wiki

Other notes

Activation function

  • The function of artificial neuron - calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not, $ Y = W * x + b $
  • Now the value of Y could be -inf to +inf, the neuron doesn't know the bounds to fire
  • Then the activation function is introduced for this purpose
  • Multiple types of activation function, step function, sigmoid function, linear function, tanh function, ReLu function (max(0,output)), in other words, thresholded at zero.
  • sigmoid - saturated neurons kill gradient; exp compute expensive; output not zero centered
  • ReLu - may die! People tend to initialize with slightly positive biases; not zero centered
  • Leaky ReLu - does not saturate, not die. $ max(alpha*x,x) $

Learning rate

Notes 1

Single neuron as linear classifier

  • Binary Softmax classifier
  • Binary SVM
  • Regularization interpretation

Lecture 6

Cross validation

Data preprocessing

  • step 1: Preprocess, zero-centered or normalized, PCA, whitening,

Weight initilization

  • Small (random) weights may work fine with small neural networks, but when it gets larger or deeper, all activations become zero.

Batch normalization

make unit gaussion activation

  • compute empirical mean and variance for each dimension
  • normalize
  • insert after fully connected/convolutional layer, before nonlinearity
  • network squash the range to $ y=gamma*x+beta $
  • benefit: allow higher learning rate; improve gradient flow; reduce dependence on initialization; as as regularization;
  • in practical, the mean and std not computed based on batch, instead use fixed empirical value

Babysitting learning process

  1. Preprocess data
  • zero centered; normalized;
  1. Choose architecture
  • start with one hidden layer
  • disable regularization, check loss, if 2-3 it's okay for 10 class
  • enable regularization, loss go up, okay
  • start train, with disabled regularization, simple vanilla SGD
  • enable regularization and find the learning rate that makes loss go down
  • conclusion 1: loss not go down, too low learning rate; loss explode, too high learning rate
  • rough learning rate 1e-3 1e-5 (for all instance??)

Hyperparameter searching

cross-validation strategy

  • first, few epochs to get rough idea what params work
  • second, longer running time, finer search; random search vs grid search
  • monitor and visualize the the loss curve, if big gap, overfitting