11. Gradient Descent - ZYL-Harry/Machine_Learning_study GitHub Wiki

Problem:
if the dataset is too large, it is too expensive to take the normal gradient descent like what we have done before, so here are some advanced ones

various gradient descent

Batch gradient descent(the original one)

using all the examples to calculate the gradient descent every time(in each iteration)

Stochastic gradient descent

randomly using one example and go to the optimum point/region with circuitous path, but much faster than the batch gradient descent

Tip: the iteration time of the outer loop depends on the size of the dataset and is about 1-10

Mini-Batch gradient descent

using b(mini-batch size) examples in each iteration
contrast to Stochastic gradient descent:
1.Vectorization: Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if there is a good vectorized implementation---having parallelizing gradient computation in b examples
2.need time to determine the parameter b in mini-batch gradient descent

check for convergency

Tip:

Map-reduce and data parallelism

Example: