Lecture 7 - AsyDynamics/CS231n Wiki

$%&

SGD problem

AdaGrad

Added element-wise scaling of gradient

RMSprop

Adam - good default choice

First order optimization

use gradient form linear approximation

Second order optimization

use gradient and hessian to form quadratic approximation

Model ensemble

Improve single model performance - Regularization

add term to loss

dropout

In each forward pass, randomly set some neurons to zero; its probability is hyperparameter, in common 0.5

Regularization

common pattern

data augmentation

Transfer learning

no need a lot of data