11. Gradient Descent - ZYL-Harry/Machine_Learning_study GitHub Wiki

  • Problem:
    if the dataset is too large, it is too expensive to take the normal gradient descent like what we have done before, so here are some advanced ones

various gradient descent

Batch gradient descent(the original one)

  • using all the examples to calculate the gradient descent every time(in each iteration)
    image

Stochastic gradient descent

  • randomly using one example and go to the optimum point/region with circuitous path, but much faster than the batch gradient descent
    image
    image
    Tip: the iteration time of the outer loop depends on the size of the dataset and is about 1-10

Mini-Batch gradient descent

  • using b(mini-batch size) examples in each iteration
    image
  • contrast to Stochastic gradient descent:
    1.Vectorization: Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if there is a good vectorized implementation---having parallelizing gradient computation in b examples
    2.need time to determine the parameter b in mini-batch gradient descent

check for convergency

image
image
Tip:
image
image

Map-reduce and data parallelism

  • Example:
    image
    image
    image