Lecture 7  AsyDynamics/CS231n Wiki
$%&
SGD problem
 Local minima, or saddle point  gradient descent stuck
 Gradients come from mini batch  easily noisy
 Solution  SGD + Momentum, with some velocity and friction
 Solution update  nesterov momentum, what's this?
AdaGrad
Added elementwise scaling of gradient
 $ grad_squared += dx * dx $
RMSprop
 grad_squared = decay_rate*grad_square + ***
Adam  good default choice

kind of like RMSProp with momentum

with bias correction

empirical: beta1 = 0.9, beta2 = 0.999, lr = 1e3 or 5e4

learning rate as hyperparameter with different optimizer

lr decay over time;
First order optimization
use gradient form linear approximation
Second order optimization
use gradient and hessian to form quadratic approximation
 secondorder Taylor expansion
 no hyperparameter, no learning rate， good for deep learning
 QuasiNewton methods, BGFS
 LBFGS, limited memory BGFS, usually work well in full batch; not well to minibatch setting
Model ensemble
 use multiple snapshots of single model, not training independent model
Improve single model performance  Regularization
 Common  L2, L1, elastic (L1+L2)
add term to loss
dropout
In each forward pass, randomly set some neurons to zero; its probability is hyperparameter, in common 0.5
 force network have redundant representation
 make the output random, and want to average out the randomness at testtime, scale
Regularization
common pattern
 add some randomness; marginalize over noise
data augmentation
 random crop and scale;
 flip;
 color jitter  random contrast and brightness; color effect
Transfer learning
no need a lot of data
 user small dataset, freeze some, reinitialize and train the others;
 bigger dataset, train more layers