Week 6 - premgane/coursera-machine-learning GitHub Wiki

Evaluating a Learning Algorithm

Deciding what to try next

If you are developing ML system, you need to be able to pick an approach.

If you use regularized linear regression but you're getting huge errors on the test set, what should you try next?

  • Get more training data
  • Smaller set of features
  • Try getting additional features that could be more predictive
  • Add polynomial features
  • Decrease or increase lambda

Unfortunately, people randomly choose an approach.

You should be using ML diagnostics to pick an approach. You should also be able to evaluate a hypothesis that your algorithm has learned.

Evaluating a Hypothesis

To figure out whether you're overfitting or something: split the training set into a dev-test set.

Split your training set into 30% dev-test set, then compute the dev-test set error. Basically compute cost function J using just these dev-test examples. We call that J_test. This dev-test set error is supposed to estimate how well your model will do on unseen examples. This is a bad idea, though.

Model Selection and Train/Validation/Test Sets

Let's say you're trying to figure out which degree polynomial function you want to use for your linear regression. You can train 10 models, each with different degree polynomial. The degree of the polynomial, we'll say, is called variable d. Then, split your data into 3 parts:

  • Training set (60%)
  • Cross-validation set (20%)
  • Test set (20%)

When picking which model to use, look at the error rate of each model on the cross-validation set. Then, pick the model with the lowest error rate. Then, you can report the error rate on the test set. We can think of this as fitting the parameter d (the degree of polynomial).

However, expect the error rate on the cross-validation set to be lower than the error on the test set. This is because we're fitting d to the cross-validation set.

Bias vs. Variance

Diagnosing Bias vs. Variance

  • High bias: underfitting
  • High variance: overfitting

J_train will keep decreasing as you increase the degree of the polynomial of your model.

J_crossvalidation will decrease for a while, then increase as you increase d. So there's an optimal spot.

So, when you're underfitting (high bias), your d is too low, so your J_train and J_cv are both high. When you're overfitting (high variance), your d is too high, so your J_cv will be high but your J_train will be low.

Regularization and Bias/Variance

When you have a lambda too large, you'll underfit because your hypothesis will have high bias (almost a straight horizontal line). You'll underfit.

When lambda is too small, overfitting because the curves of the hypothesis are too extreme and tries to fit every training example.

Start with a range of values for lambda. E.g., 0, 0.1, 0.2, ..., 10. Minimize your cost function for each of these lambda values. You'll end up with a bunch of theta vectors. Then, pick the lambda that gives the lowest error on the cross-validation set (J_cv). After you pick lambda, you finally get to find out the test error by finding the cost with the test set.

Large lambda (underfit): high J_train, high J_cv

Small lambda (overfit): low J_train, high J_cv

Learning Curves

Deliberately limit the size of the training set more and more to find the J_train for each training set size.

When the training set size is small, training error will be small. E.g., with just 2 data points, of course you'll find a good hypothesis to fit both those points. But as you have more points, your hypothesis will fit the data well on average, but the error will be larger.

However, J_cv will decrease as you increase the number of training examples because the hypothesis fits the real world better.

When you have a high bias (underfit), your J_cv will decrease as you add more training examples, but not by much. Your J_train will be really high almost the whole time, though, because the hypothesis isn't good enough to fit the training set. Insight: when you have underfitting (high bias), a bigger training set won't help!

When you have a high variance (overfit), J_train will increase as you add more training data, but not by much. J_cv will keep decreasing as you add more training data -- and it won't really flatten out. Insight: Your hypothesis is good at mimicking data, so training set size will greatly impact J_cv and J_test.

Deciding What to Do Next: Revisited

  • Getting more training examples: helps with the high-variance (overfitting) issue.
  • Trying a smaller set of features: helps with high-variance (overfitting).
  • Trying to add more features: helps with high bias (underfit).
  • Trying to add polynomial features: helps with high bias (underfit).
  • Decrease lambda: helps with high bias (underfit).
  • Increase lambda: helps with high variance (overfit).

Neural networks

Small neural network architecture: computationally cheaper. fewer parameters: more prone to underfitting.

Large NN: more expensive. more parameters: more prone to overfitting. But you can use regularization (larger lambda) to address overfitting.

To pick the number of layers, use the J_cv with each architecture and pick the best one.

Building a Spam Classifier

Prioritizing What to Work On

When building a spam classifier, one approach is to get a lot of training data to find the most commonly-occurring words. Then your feature vector will be 0s and 1s, each element representing whether that word appears in the test example.

How to best spend your time for this?

  • Collect a lot of data
  • Find sophisticated features based on headers
  • Find sophisticated features based on message body. NLP basically.
  • Develop sophisticated algorithm to detect misspellings, which spam is more likely to have.

You need to be methodical about brainstorming these kinds of approaches and picking one. Don't just use your instinct.

You can use error analysis to pick an approach.

Error Analysis

Start with a simple algorithm that you can implement within a day. Implement and test on your cross-validation data.

Plot learning curves to figure out more data, more features, or whatever would be helpful. This prevents pre-mature optimization.

Error analysis: manually look at cross-validation set data that your algorithm made errors on. See what you could use from those examples that would've made your algorithm classify it correctly.

For example, look at the mis-classified emails. 2 things you can do from here:

  1. Categorize them based on what kind of spam it is.
  2. Brainstorm features that would have helped with classification.

An important thing to do is to use numerical evaluation. This will allow you to move away from using your intuition. Look at accuracy, precision, or recall, or something else. Example: run your algorithm with and without stemming, and pick which approach to use.

Handling Skewed Data

Error Metrics for Skewed Classes

When you have a lot more examples from one class than from another class, you have skewed classes. Accuracy isn't enough to tell you which model is the best. This is because just classifying y = 0 for all inputs may give you a tiny error rate -- there may be just a few y = 1 examples in the training and test set.

You can use precision and recall to overcome the fact that accuracy becomes a poor measure in the skewed classes case.

Precision (local to just our predictions): True positives / Predicted positives

Recall (global to the whole population): True positives / Actual positives

Trading Off Precision and Recall

Higher precision => Lower recall

Lower precision => Higher recall

How do we compare different pairs of precision/recall numbers? Average of the two is a poor indicator. Use F_1 score (F-score).

F-Score = 2 * (P*R)/(P+R)

This gives you a single number between 0 and 1.

For F-score to be large, both P and R needs to be large.

Using Large Data Sets

Data for Machine Learning

Conjecture: Despite algorithm, having more data will always be better.

Large data rationale

Assume a feature vector x has sufficient information to predict y accurately. (Useful test: given the input x, can a human expert confidently predict y?)

Use a learning algorithm with many parameters: unlikely to underfit. Use a large training set: unlikely to overfit.

This way, (given a good set of features) you can guarantee good results with a large training set. Of course, if your features are bad, the training set size wouldn't matter as much.