Week 2 - premgane/coursera-machine-learning GitHub Wiki

Multiple Features

Linear regression can also have multiple variables. Each input vector would have n dimensions.

x_n refers to the nth feature of the input feature vector. Also, remember that x^(i) refers to the i-th input vector in the data.

x^(3)_2 refers to the 2nd feature of the 3rd example input vector.

Hypothesis for multiple features

h_theta(x) = theta_0 + (theta_1 * x_1) + ... + (theta_n * x_n)

To simplify,

x_0 = 1, which means x^(i)_0 = 1. This allows us to have 0-indexed input vectors, as well as a 0-indexed parameter vector theta.

h_theta(x) = (theta_0 * x_0) + ... + (theta_n * x_n)

x = [x_0, ..., x_n] <--(should be vertical, 1 by n matrix)

theta_Transpose = [theta_0, ..., theta_n] <--(horizontal, n by 1 matrix)

h_theta(x) = theta_Transpose * x => scalar

Gradient Descent for Multiple Variables

Hypothesis is h_theta(x). Parameters is the vector theta, which is (n+1)-dimensional. We'll call the cost function just J(theta) instead of J(theta_0, ..., theta_n).

Gradient descent looks like this:

Simultaneously update for every j = 0, ..., n.

theta_j = theta_j - alpha*(partial derivative w.r.t. theta_j)*J(theta)

where alpha is the learning rate. You get:

theta_j = theta_j - alpha * (1/m) * sum from i=1 to m (h_theta(x^(i)) - y^(i)) * x^(i)_j)

where m is the number of training examples. See Week 1 for explanation of the single-variable case.

Practical Trick 1: Feature Scaling

Make sure the features are on a similar scale.

E.g., let's say x_1 is the size of a house in square feet. Let's say x_2 is the number of bedrooms. The contour diagram could be very skinny, and gradient descent will meander back and forth and take a long time to converge. We prefer to have a contour diagram that's much closer to a circle.

Try to get all features to be between -1 and 1, or something close to that, within an order of magnitude. It's just an approximate way of making gradient descent work better.

Mean normalization: Replace x_i with (x_i - mu_i)/s_i to make features have approx 0 mean and a range of around 1. Mu is the mean value of the feature, and s is either the range or standard deviation of the feature, depending on what's easily available to us.

Practical Trick 2: Learning Rate

This trick is about the learning rate, alpha. How do we choose it?

First, you need to be able to debug gradient descent. You can plot the value of J(theta) for number of iterations. If gradient descent is working properly, J(theta) should decrease after each iteration. After a number of iterations, J(theta) will stop decreasing much.

You can declare that we have converged after gradient descent stops changing by some small number. But this number is hard to choose, so just plot out the J(theta) after each iteration.

If J(theta) increases, or fluctuates, then your alpha is probably too large.

If the convergence is very slow, alpha might be too small. It could also be some large, though.

Try running gradient descent with a bunch of different values. Two within each order of magnitude:

0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, etc.

If you see one value that's too small, and one value that's too large, then you've found your proper alpha.

Features and Polynomial Regression

When given a list of features, you could create a new feature by yourself using other features. E.g., if you're given the width of a house and its length, you can multiply them to get the area of the house. This might be a better predictor of the price of the house. But an x^2 feature is a parabola, but we don't expect the price of the house to drop after we reach the global max of the parabola. So it's more likely to be a cubic function instead, so we need a x^3 feature.

You can do this with a simple modification to linear regression.

In this case, you add a new feature, which is determined by x^2. Then add another feature, which is x^3. Instead of x^3, you could even do sqrt(x) if you think that's a better predictor.

When you do this, feature scaling and mean normalization become much more important. Your transformations of the data will cause different features to have wildly different means and ranges.

Later in the course, we will encounter some algorithms that can help us choose what features to use.

Normal Equation

theta = (X-transpose * X)^-1 * X-transpose * y

Gradient descent is iterative. Normal equation is a method to solve for theta analytically.

If theta were a scalar, you'd minimize it by taking the derivative of J(theta), setting it to 0, and solve for theta. Since it's a vector, take the partial derivative of the cost function J(theta) with respect to each element of theta and set them to 0 and solve.

All the examples X would be an m by (n+1) matrix and y would be an m-dimensional column vector. (n+1) is because we use theta_0 = 1 for all x-vectors. The matrix X is also called the "design matrix". Take each column vector of training value x, transpose it, and make it a row of X.

Breakdown of the Normal Equation

For reference, this is the Normal Equation:

theta = (X-transpose * X)^-1 * X-transpose * y

pinv(X'*X) * X' * y

(X-transpose * X)^-1 is the inverse of matrix X-transpose * X.

Feature scaling

When using Normal Equation, feature scaling is not necessary.

Advantages & Disadvantages

Gradient descent: need to choose learning rate alpha. Need many iterations. Works even when n is large. Works on more sophisticated things than linear regression.

Normal equation: Slow if n (number of features) is very large because (X-transpose * X)^-1 takes O(n^3) time. A large n is around 10000.

Normal Equation Non-Invertibility

What if (X-transpose * X) is non-invertible? (btw Octave will do the right thing if you use pinv, which is pseudoinverse.)

This could happen in 2 cases:

Redundant features (one feature is linearly dependent on another). E.g., one feature is size in feet, and another is size in meters. Delete one of these features to solve the issue.
Too many features (m <= n). Delete some features or use "regularization" (discussed later).