Deep_Learning - RicoJia/notes GitHub Wiki
========================================================================
========================================================================
-
When you have VERY DEEP NETWORKS, your gradients will grow exponentially large or small.
initialized poorly, or learning rates are too high
- Initialization Alternatives: Xavier and He.
- Xavier initialization: for sigmoid or tanh
- He initialization: this preseves a variance of
$2/n$ in its weights, which is impirically good for ReLu
- Gradient Clipping
- Normalization Layers: use batch normalization or layer normalization
- layer normalization:
- Batch Normalization: operates across channels
- Initialization Alternatives: Xavier and He.
Source: Vanishing Gradient Is More Visible In Shallow Layers
- A really helpful way to debug a network is to implement gradient check. ONLY TO CHECK!!
- Do a forward and backward pass. Get gradient
$d = \frac{\partial J}{\partial w}$ , and total cost$J$ - perturb 1 parameter
$W^L_i$ by$\epsilon$ . forward pass, get the cost J'. - Calculate gradient:
$d' = (J-J')/ \epsilon$ . $$ \frac{|d-d'|}{|d| + |d'|} $$- If gradient is $ < 10^{-7}$, great! If smaller than
$10^{-3}$ , bad
- If gradient is $ < 10^{-7}$, great! If smaller than
- Note:
- Run again after some training, because initialization might randomly yield good results there
- Need to add regularization terms here
- Doesn't work with Drop out
- Do a forward and backward pass. Get gradient
- Normalize data. Shift the mean, and even the variance?? If your input data comes from very different sources, doing min-max could be helpful.
========================================================================
======================================================================== Impl
- Convolution Layer
- Relu:
- Learns non-linearity in features
- Feature Scaling: ensuring negative values do not pass thru. Reducing multiplications