Lecture 7 - AsyDynamics/CS231n GitHub Wiki
$%&
SGD problem
- Local minima, or saddle point - gradient descent stuck
- Gradients come from mini batch - easily noisy
- Solution - SGD + Momentum, with some velocity and friction
- Solution update - nesterov momentum, what's this?
AdaGrad
Added element-wise scaling of gradient
- $ grad_squared += dx * dx $
RMSprop
- grad_squared = decay_rate*grad_square + ***
Adam - good default choice
-
kind of like RMSProp with momentum
-
with bias correction
-
empirical: beta1 = 0.9, beta2 = 0.999, lr = 1e-3 or 5e-4
-
learning rate as hyperparameter with different optimizer
-
lr decay over time;
First order optimization
use gradient form linear approximation
Second order optimization
use gradient and hessian to form quadratic approximation
- second-order Taylor expansion
- no hyperparameter, no learning rate, good for deep learning
- Quasi-Newton methods, BGFS
- L-BFGS, limited memory BGFS, usually work well in full batch; not well to mini-batch setting
Model ensemble
- use multiple snapshots of single model, not training independent model
Improve single model performance - Regularization
- Common - L2, L1, elastic (L1+L2)
add term to loss
dropout
In each forward pass, randomly set some neurons to zero; its probability is hyperparameter, in common 0.5
- force network have redundant representation
- make the output random, and want to average out the randomness at test-time, scale
Regularization
common pattern
- add some randomness; marginalize over noise
data augmentation
- random crop and scale;
- flip;
- color jitter - random contrast and brightness; color effect
Transfer learning
no need a lot of data
- user small dataset, freeze some, reinitialize and train the others;
- bigger dataset, train more layers