Lecture 7 - AsyDynamics/CS231n GitHub Wiki

$%&

SGD problem

Local minima, or saddle point - gradient descent stuck
Gradients come from mini batch - easily noisy
Solution - SGD + Momentum, with some velocity and friction
Solution update - nesterov momentum, what's this?

AdaGrad

Added element-wise scaling of gradient

$ grad_squared += dx * dx $

RMSprop

grad_squared = decay_rate*grad_square + ***

Adam - good default choice

kind of like RMSProp with momentum
with bias correction
empirical: beta1 = 0.9, beta2 = 0.999, lr = 1e-3 or 5e-4
learning rate as hyperparameter with different optimizer
lr decay over time;

First order optimization

use gradient form linear approximation

Second order optimization

use gradient and hessian to form quadratic approximation

second-order Taylor expansion
no hyperparameter, no learning rate， good for deep learning
Quasi-Newton methods, BGFS
L-BFGS, limited memory BGFS, usually work well in full batch; not well to mini-batch setting

Model ensemble

use multiple snapshots of single model, not training independent model

Improve single model performance - Regularization

Common - L2, L1, elastic (L1+L2)

add term to loss

dropout

In each forward pass, randomly set some neurons to zero; its probability is hyperparameter, in common 0.5

force network have redundant representation
make the output random, and want to average out the randomness at test-time, scale

Regularization

common pattern

add some randomness; marginalize over noise

data augmentation

random crop and scale;
flip;
color jitter - random contrast and brightness; color effect

Transfer learning

no need a lot of data

user small dataset, freeze some, reinitialize and train the others;
bigger dataset, train more layers