# Lecture 6 - AsyDynamics/CS231n Wiki

## Other notes

### Activation function

• The function of artificial neuron - calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not, \$ Y = W * x + b \$
• Now the value of Y could be -inf to +inf, the neuron doesn't know the bounds to fire
• Then the activation function is introduced for this purpose
• Multiple types of activation function, step function, sigmoid function, linear function, tanh function, ReLu function (max(0,output)), in other words, thresholded at zero.
• sigmoid - saturated neurons kill gradient; exp compute expensive; output not zero centered
• ReLu - may die! People tend to initialize with slightly positive biases; not zero centered
• Leaky ReLu - does not saturate, not die. \$ max(alpha*x,x) \$

## Notes 1

### Single neuron as linear classifier

• Binary Softmax classifier
• Binary SVM
• Regularization interpretation

## Lecture 6

### Data preprocessing

• step 1: Preprocess, zero-centered or normalized, PCA, whitening,

### Weight initilization

• Small (random) weights may work fine with small neural networks, but when it gets larger or deeper, all activations become zero.

### Batch normalization

make unit gaussion activation

• compute empirical mean and variance for each dimension
• normalize
• insert after fully connected/convolutional layer, before nonlinearity
• network squash the range to \$ y=gamma*x+beta \$
• benefit: allow higher learning rate; improve gradient flow; reduce dependence on initialization; as as regularization;
• in practical, the mean and std not computed based on batch, instead use fixed empirical value

### Babysitting learning process

1. Preprocess data
• zero centered; normalized;
1. Choose architecture
• disable regularization, check loss, if 2-3 it's okay for 10 class
• enable regularization, loss go up, okay
• start train, with disabled regularization, simple vanilla SGD
• enable regularization and find the learning rate that makes loss go down
• conclusion 1: loss not go down, too low learning rate; loss explode, too high learning rate
• rough learning rate 1e-3 1e-5 (for all instance??)

### Hyperparameter searching

cross-validation strategy

• first, few epochs to get rough idea what params work
• second, longer running time, finer search; random search vs grid search
• monitor and visualize the the loss curve, if big gap, overfitting