Week 1 - premgane/coursera-machine-learning GitHub Wiki
Introduction
Lots of implications for machine learning.
Supervised Learning
We give the algorithm the "right answers" along with the dataset. The task of the algorithm is to produce more of these right answers.
Regression problem
Trying to predict a continuous value output based on input. E.g., predict price of house based on square footage.
Classification problem
Trying to predict a discrete value (2 or more possible values) output based on input. E.g., predict type of cancer based on tumor size.
You can either draw the output values as 0, 1, etc. on the y-axis. Or, you could squish it down to the x-axis and draw X vs O for malignant vs. benign tumor. This allows us to draw, on paper, 3-dimensional spaces in just 2 dimensions (i.e., 2 input variables like tumor size and age of patient).
How do you deal with or store infinite number of features? Support vector machines use a neat mathematical trick to represent an infinitely long list of features.
Unsupervised Learning
Make predictions without knowing the "right answer" to train the algorithm. Given a set of unlabeled data, can you find some sort of structure in it? E.g., clustering algorithms.
Another example is the Cocktail Party Problem. Given audio from 2 microphones, isolate what each individual in a recording is saying. Cool thing about this problem is the learning algorithm can be written in one line of Octave.
Model and Cost Function
Model representation
m = number of training examples x = input variable / features y = output variable / target variable
x^(i) = i-th training example
Training set gets fed into learning algo. Learning algo outputs a hypothesis, h. h is a function that maps from x's to y's.
Linear variable with one variable, aka univariate:
h_theta(x) = theta_0 + theta_1 * x = h(x)
Cost function
theta_0 and theta_1 are parameters. We have to choose them such that h_theta(x) is close to y for our training examples (x,y).
Find the values of theta_0 and theta_1 such that the following is minimized:
(1/2m) * sum from i -> m of (h(x^(i)) - y^(i))^2 = J(theta_0, theta_1)
This is called a cost function, specifically squared error cost function. Squared error is used in most linear regression problems.
FAQs about the cost function
Q: Why is the cost function about the sum of squares, rather than the sum of cubes (or just the (h(x)−y) or abs(h(x)−y) ) ?
A: It might be easier to think of this as measuring the distance of two points. In this case, we are measuring the distance of two multi-dimensional values (i.e. the observed output value yi and the estimated output value yˆi). We all know how to measure the distance of two points (x1,y1) and (x2,y2). If we have n-dimension then we want the positive square root of ∑i=1 to i=n of (xi−yi)^2. That's where the sum of squares comes from. (see also Euclidean distance) The sum of squares isn’t the only possible cost function, but it has many nice properties. Squaring the error means that an overestimate is "punished" just the same as an underestimate: an error of -1 is treated just like +1, and the two equal but opposite errors can’t cancel each other. If we cube the error (or just use the difference), we lose this property. Also in the case of cubing, big errors are punished more than small ones, so an error of 2 becomes 8. The squaring function is smooth (can be differentiated) and yields linear forms after differentiation, which is nice for optimization. It also has the property of being “convex”. A convex cost function guarantees there will be a global minimum, so our algorithms will converge. If you throw in absolute value, then you get a non-differentiable function. If you try to take the derivative of abs(x) and set it equal to zero to find the minimum, you won't get any answers since it's undefined in 0.
Q: Why can’t I use 4th powers in the cost function? Don’t they have the nice properties of squares?
A: Imagine that you are throwing darts at a dartboard, or firing arrows at a target. If you use the sum of squares as the error (where the center of the bulls-eye is the origin of the coordinate system), the error is the distance from the center. Now rotate the coordinates by 30 degree, or 45 degrees, or anything. The distance, and hence the error, remains unchanged. 4th powers lack this property, which is known as “rotational invariance”.
Q: Why does 1/(2 * m) make the math easier?
A: When we differentiate the cost to calculate the gradient, we get a factor of 2 in the numerator, due to the exponent inside the sum. This '2' in the numerator cancels-out with the '2' in the denominator, saving us one math operation in the formula.
Cost Function: Intuition 1
The cost function looks at the hypothesis and measures the errors (difference between hypothesis and each training data point). Bigger errors are penalized more than smaller errors, hence the squares.
Let's ignore theta_0 for now. J(theta_1) is a parabola-that-opens-upward, always positive. We want to choose the value of theta_1 that is at the minimum of that parabola.
Cost Function: Intuition 2
Contour plots/figures are a representation of a 3D surface in 2D space.
We can use a contour plot to represent the error J given theta_0 and theta_1. J(theta_0, theta_1) is kind of a 3D parabola, like a bowl. Minimizing the cost function is trying to find the bottom of this bowl.
Our goal is to come up with an algorithm that efficiently finds the bottom of the bowl.
Parameter Learning
Parameter learning is to find the optimal parameters theta_0, theta_1, etc such that J(theta_0, ..., theta_n) is minimized.
Gradient descent for theta_0 and theta_1
Initialize theta_0 and theta_1 to some values. We start at some point on the surface of the J function. Then, iteratively take small steps in the direction that would take you the most downhill.
Algorithm
For j = 0 and j = 1:
theta_j := theta_j - alpha * [partial derivative w.r.t. theta_j of J(theta_0, theta_1)]
alpha is the learning rate. Controls how big a step we take.
We want to simultaneously update theta_0 and theta_1. This means: we use the old value of theta_0 and theta_1 to find the new values of theta_0 and theta_1. Then, we use these new values to do our next iteration. Don't, for example, use the new theta_0 and the old theta_1 to find the new theta_1.
Gradient Descent Intuition
The derivative term looks at the slope of the tangent that touches J at point theta_j. If the slope is positive, we move to the left (alpha has a minus sign in front of it). This pulls us down the curve of J to get closer to the minimum.
The same applies for a negative tangent slope. We move to the right because we're in the left half of the parabola of J.
Alpha determines the size of the step we take toward the minimum. If alpha is too small, we need to take a lot of steps. If alpha is too large, we'll take a huge step if we're close to the minimum, and we'll overshoot the minimum. We might not converge at the minimum, and we may even diverge and move away from the minimum.
If the tangent is 0, then we're already at a local minimum. Then, theta_j will not update to a new value at this iteration.
Gradient descent can converge to a local minimum even with alpha fixed across all iterations. This is because J is curved, and its tangent's slope will become shallower and shallower. The derivative term will keep getting smaller, and we'll take smaller and smaller steps anyway.
Gradient Descent for Linear Regression
We need to find the partial derivative term for linear regression's cost function J(theta_0, theta_1). We need to do this for both the theta_0 case and theta_1 case. For this class, no need to do the actual derivative -- we're just given the derivative.
The cost function for linear regression will always be bowl-shaped, also known as a convex function. This means there will only be one minimum.
Batch gradient descent means at each step, we're looking at all input training examples. This is what we are doing. This is opposed to online gradient descent, where we don't get to look at all the training examples at each iteration.
Normal Equation Method (from linear algebra) can simply solve for the minimum of the cost function rather than iteratively converging on it. However, it does not scale as well as iteratively descending the gradient.