Week 10 - premgane/coursera-machine-learning GitHub Wiki

Gradient Descent with Large Datasets

Learning with large datasets

In the last 10-ish years, the amount of training data has helped ML. Low-bias learning algo + large amount of data = good ML results.

Unique, computational problems.

E.g., 100 million data points with linear regression gradient descent means summing over 100 million data points per descent step. One good idea might be to randomly sample 1000 examples and train on that to see if we might get good results if we used more data (use learning curves, explained earlier in the course).

Stochastic gradient descent

Let's use linear regression as the example, but this should work with all algorithms that use gradient descent to get to a global minimum of a cost function.

Using batch gradient descent, we need to look at all training examples at each step and calculate the partial derivative and make a decision about where to step to. (This is what we've been doing in this course).

For stochastic gradient descent, we'll redefine the cost function at a given point as being half the square error at that point. Then, our J_train will be the average over all points of that cost function.

Stochasti Gradient descent

Randomly shuffle dataset
Repeat (maybe up to 10 times if it's a small dataset): for each training example, update the parameters to fit this training example better.

Rather than making a pass through all training examples before taking a step, we take a step for each training example.

Batch gradient descent takes a relatively straight line toward the global minimum (if convex gradient). Stochastic gradient descent will take kind of a circuitous path toward the region close to the gradient descent. It won't converge, but it's good enough and much faster.

If it's a large dataset (e.g., 100 million examples), just one pass through the training data might be enough to train the whole model. In batch gradient descent, you need one pass per descent step.

Mini-batch gradient descent

Batch gradient descent uses all m examples in each iteration. Stochastic gradient descent uses 1 example in each iteration.

Mini-batch gradient descent uses b examples in each iteration, where b is in [2, 100] or so. So you average over 2-100 examples to find the direction to step in for each descent step. Slightly faster way of working through the whole dataset.

Mini-batch has an advantage over stochastic gradient descent if you can parallelize the summation using a vectorized implementation.

Stochastic gradient descent convergence

How do you make sure stochastic gradient descent is converging? Also, how do you pick step size alpha?

For each gradient step in the stochastic gradient descent learning process, every 1000 examples, plot the average of the cost (half the square error for a specific example) of the last 1000 examples. This should have a dropoff with a plateau after a certain point. If it plateaus, then it has converged probably.

If you average over a larger number of examples, then you get a data point less often (once every e.g., 5k examples), but it's a smoother line. Also you won't see oscillations -- bigger oscillations may mask the fact that it's not converging.

If you really want to converge to the global minimum, you can slowly decrease learning rate alpha. The rate at which you decrease the learning rate also needs to be fiddled with, so it's more work for you. This is usually not worth it, though.

Advanced Topics

Online learning

Allows us to model problems where we have a continuous stream of data coming in.

E.g. we're doing logistic regression based on incoming customer data.

For each example that comes in, learn from it to update theta, then throw away the data.

Online learning can adapt to changing user preferences.

Map-reduce and data parallelism

We can take advantage of multiple machines to do ML. E.g., map-reduce, which is equally important as stochastic gradient descent.

Map-reduce

If we have 4 machines, each machine will do batch gradient descent on 25% of the training data. When that's done (in parallel), add up the results to find the gradient and do take descent step to update theta. This means a theoretical 4x speedup (if no network latency).

Map-reduce is important whenever you need to do a sum over the whole training set. If that's the bottleneck of a training algorithm, then use map-reduce.

Not necessary to use multiple machines, can use multi-core machines too. If you have a parellelizable numerical linear algebra library and a vectorized implementation of your learning algorithm, you may not even need map-reduce on a multi-core machine.

In NNs, each map-reduce machine would do 1/n of the forward-prop and back-prop and you'd add it up to find the gradient.

Photo OCR

We'll use a real-world ML application to demonstrate some concepts.

Problem description and pipeline

Problem: detect whether and where there's text in the given photo (not a document). Then, read the text.

Text detection: draw rectangle around text region
Character segmentation
Character classification: what is each character?

This is called a ML pipeline: breaking down a problem into smaller ML sub-problems.

Sliding windows

Let's talk about a simpler problem: pedestrian detection. This is easier because the ratio of the height to width of the region of a photo containing a pedestrian is usually constant. For text, this is not constant: string length, font size, etc.

Supervised learning: given a rectangular photo, predict whether it contains a pedestrian.

We can feed a sliding window of pixels to the classifier (step size / stride pixels over) and get any regions with pedestrians. Then, make the window bigger (a larger patch of pixels) -- then, classify all the patches of pixels with this window size.

Back to the OCR case. Train a classifier that takes a patch of pixels and determines whether it contains text. Feed sliding windows from the given photo into the classifier. Generate a heat map of probabilities of each patch containing text.

Then, apply an expansion to make the size of the hot patches bigger. You'll get bigger hot patches in some places, and you can eliminate regions where the height is taller than the width (probably not text). Then, the regions that are left probably contain text.

Within each region, to do character segmentation, use a character-recognizing classifier. Then, use sliding windows again to classify each patch, but no need to move the window vertically.

For each character, use a multi-class classifier to determine what character is in it.

Getting lots of data and artificial data

Low-bias learning algorithm + large amounts of data == good results. To get large datasets, we can use artificial data syntehsis.

For photo OCR, for more training examples for character recognition, pick a random font and a random background image (plus some blurring, affine transformations, etc) and generate a photo of a character. Create a ton of data easily like this, but you need to make sure the synthesized data is good.

You can also take real training data and apply some distortions to the photo to get more photos. For audio, you could take a clean recording training example add more background sounds to stimulate noisy environments.

Make sure the visual/audio noise you use is representative of what you expect to see in the test set. Random, meaningless noise won't help.

Make sure you know whether more training examples would help by using learning curves. E.g., keep increasing the number of features / number of hidden units in the neural network until you have a low-bias classifier.

Generating synthetic data sometimes may be more difficult than getting real data. Make sure to think hard before generating synthetic data; maybe you (or crowdsourcing) can collect / label enough data in a shorter amount of time than it takes to create a data synthesizer.

Ceiling analysis: what part of the pipeline to work on next

In a pipeline, for each module, simulate what would happen if you had 100% accuracy (or some other 1-number summary of the system). How much boost do you get over the current performance of the whole system?

Example:

In a pipeline of A -> B -> C, what would happen if A were perfect? Accuracy goes up by 10%.

If A and B were perfect, accuracy only goes up by 0.5% compared to when only A were perfect.

If A, B, and C were perfect, accuracy goes up by 35% (to obviously get 100% accuracy of overall system).

So focus on improving C first because we get the biggest boost there.

Conclusion

Main topics:

Supervised learning: lin reg, log reg, nn, SVMs
Unsupervised learning: k-means, PCA, anomaly detection
Special topics: recommendations, large scale ML
Advice on building a ML system: bias/variance, regularization, deciding what to work on next: [eval of learning algos, learning curves, error analysis, ceiling analysis]