Vanishing exploding gradient - rugbyprof/5443-Data-Mining GitHub Wiki

Vanishing Gradient Problem

Vanishing Gradient Problem is a difficulty found in training certain ANN with gradient based methods. Gradient based methods learn a parameter's value by understanding how a small change in the parameter's value will affect the network's output. If a change in the parameter's value causes very small change in the network's output - the network just can't learn the parameter effectively, which is a problem. This is exactly what's happening in the vanishing gradient problem -- the gradients of the network's output with respect to the parameters in the early layers become extremely small. Vanishing gradient problem depends on the choice of the activation function. Many common activation functions (e.g. sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear fashion. This becomes much worse when we stack multiple layers of such non-linearities on top of each other. For instance, first layer will map a large input region to a smaller output region, which will be mapped to an even smaller region by the second layer, which will be mapped to an even smaller region by the third layer and so on. As a result, even a large change in the parameters of the first layer doesn't change the output much. We can avoid this problem by using activation functions which don't have this property of 'squashing' the input space into a small region. A popular choice is Rectified Linear Unit (RLU) which maps x to max(O,x). (https://www.quora.com/What-is-the-vanishing-gradient-problem)