Lecture 7 - AsyDynamics/CS231n GitHub Wiki

$%&

SGD problem

  • Local minima, or saddle point - gradient descent stuck
  • Gradients come from mini batch - easily noisy
  • Solution - SGD + Momentum, with some velocity and friction
  • Solution update - nesterov momentum, what's this?

AdaGrad

Added element-wise scaling of gradient

  • $ grad_squared += dx * dx $

RMSprop

  • grad_squared = decay_rate*grad_square + ***

Adam - good default choice

  • kind of like RMSProp with momentum

  • with bias correction

  • empirical: beta1 = 0.9, beta2 = 0.999, lr = 1e-3 or 5e-4

  • learning rate as hyperparameter with different optimizer

  • lr decay over time;

First order optimization

use gradient form linear approximation

Second order optimization

use gradient and hessian to form quadratic approximation

  • second-order Taylor expansion
  • no hyperparameter, no learning rate, good for deep learning
  • Quasi-Newton methods, BGFS
  • L-BFGS, limited memory BGFS, usually work well in full batch; not well to mini-batch setting

Model ensemble

  • use multiple snapshots of single model, not training independent model

Improve single model performance - Regularization

  • Common - L2, L1, elastic (L1+L2)

add term to loss

dropout

In each forward pass, randomly set some neurons to zero; its probability is hyperparameter, in common 0.5

  • force network have redundant representation
  • make the output random, and want to average out the randomness at test-time, scale

Regularization

common pattern

  • add some randomness; marginalize over noise

data augmentation

  • random crop and scale;
  • flip;
  • color jitter - random contrast and brightness; color effect

Transfer learning

no need a lot of data

  • user small dataset, freeze some, reinitialize and train the others;
  • bigger dataset, train more layers