1. Linear Regression - ZYL-Harry/Machine_Learning_study GitHub Wiki

Linear regression with one variable

Process of a supervised learning algorithm

given a training set
feed the training set to the learning algorithm
from the learning algorithm, get an output function h(hypothesis)
with the hypothesis and an input, get output(the estimated value)

Start with fitting linear functions, and then build on this to more complex models and more complex learning algorithms

How to represent hypothesis

Hypothesis: use to make predictions

How to chose θ

Method: find the values of θs so that cost function is minimized

cost function: Squared error function:

Cost function

To better visulize the cost function, we do this:

Discuss simplized hypothesis and cost function:
For each value of θ, it corresponds to a different hypothesis and a different value of cost function

Then, it is easy to find the value of θ which can minimize the cost function and find the proper hypothesis

If there are two θs, then the cost function will be a 3-dimensional surface plot-a convex function:

To better visualize it, the contour figure is often used:

Gradient descent

Function: minimizing the cost function
Procedure:

Find an initial point to start

Take a step in the direction of steepest descent

Repeat step2, until converge to a local optimum

Mathematical expression:

α is called the learning rate, which controls how big a step we take downhill with gradient descent(updating the parameter θj), it is always a positive number
Q: If the parameter θ is already at a local minimum, what will on step of gradient descent do?
A: Then, the parameter θ won't change, it will keep the solution at the local optimum.
\

Gradient descent => simultaneous update

The derivative term:

Gradient descent for linear regression

Then, put together gradient descent with the cost function to get an algorithm for linear regression for fitting a straight line to our data

Key: the derivative term:

Then, update the parameter θs simultaneously
The process in the contour figure is shown:

Tip:

"Batch" Gradient Descent---Each step of gradient descent uses all the training examples
Gradient Descent is better when solving problems with large data set
There are also some other methods to solve the minimum of the cost function, like normal equations methods

Exercise by python

1. Create a 5*5 identity matrix

A = np.eye(5)

Output:

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]

2. Linear regression with one variable

2.1 Plotting the data

data = pd.read_csv(path, header=None, names=['Population', 'Profit'])
plt.figure()
data.plot(kind='scatter', x='Population', y='Profit', color='red', marker='x', figsize=(10,10), xlabel='Population of City in 10,000s', ylabel='Profit in $10,000s')
plt.show()

Output:

2.2 main function

if __name__ == '__main__':
    # add an additional first column to X and set it to all ones
    data.insert(0, 'Ones', 1)

    # take the value of x[ones,population] and y[profit]
    x = data.loc[:, ['Ones', 'Population']]
    x = np.matrix(x.values)
    y = data.loc[:, ['Profit']]
    y = np.matrix(y.values)
    # initialize theta
    theta = np.matrix(np.array([0, 0]))
    # compute the cost function
    J = Cost_Function(x, y, theta)

    # gradient descent
    alpha = 0.01
    iterations = 1500
    j = np.zeros((iterations, 1))
    theta_all = theta
    for i in range(iterations):
        # compute the cost function
        j[i] = Cost_Function(x, y, theta)
        # update theta
        theta = Update_theta(x, y, theta, alpha)
        theta_all = np.row_stack((theta_all, theta))
    theta_iterations = np.delete(theta_all, 0, 0)

2.3 Cost Function

def Cost_Function(x, y, theta):
    inner = np.power(((theta * x.T) - y.T), 2)
    return np.sum(inner) / (2 * len(x))

2.4 Update θs

def Update_theta(x, y, theta, alpha):
    error = (theta * x.T) - y.T
    error_sum0 = np.sum(error) / len(x)
    error_sum1 = np.sum(error * x[:, 1]) / len(x)
    # need another variable to replace θ assignment
    temp = np.matrix(np.zeros(theta.shape))
    temp[0, 0] = theta[0, 0] - alpha * error_sum0
    temp[0, 1] = theta[0, 1] - alpha * error_sum1
    theta = temp
    return theta

Output:

-3.63029144 1.16636235

2.5 plot the regression figure

xx = np.linspace(data.Population.min(), data.Population.max(), 100)
f = theta[0, 0] + theta[0, 1] * xx
plt.figure()
data.plot(kind='scatter', x='Population', y='Profit', color='red', marker='x', figsize=(10, 10),xlabel='Population of City in 10,000s', ylabel='Profit in $10,000s')
plt.plot(xx, f, color='blue')
plt.show()

Output:

2.6 visualize the cost function

fig = plt.figure()
ax = plt.axes(projection='3d')
theta0, theta1 = np.meshgrid(theta_iterations[:, 0].ravel().T, theta_iterations[:, 1].ravel().T)
j_plot = np.zeros((iterations,iterations))
for i in range(iterations):
    for j in range(iterations):
        theta_temp = [theta0[i, j], theta1[i, j]]
        j_plot[i, j] = Cost_Function(x, y, theta_temp)
ax.plot_surface(theta0, theta1, j_plot, cmap='rainbow')
ax.set_xlabel('θ0')
ax.set_ylabel('θ1')
ax.set_zlabel('J')
plt.show()

Output:

2.7 plot the contour figure

plt.figure()
plt.contour(theta0, theta1, j_plot)
plt.plot(theta[0, 0], theta[0, 1], color='red', marker='x')
plt.xlabel('θ0')
plt.ylabel('θ1')
plt.show()

Output:

2.8 predictions

information1 = np.matrix(np.array([1, 3.5]))
prediction1 = theta * information1.T
print(prediction1)
information2 = np.matrix(np.array([1, 7]))
prediction2 = theta * information2.T
print(prediction2)

Output:

0.45197679
4.53424501

Linear regression with multiple variables

New type of hypothesis

For convenience of notation, we have the following process:

Gradient descent for linear regression with multiple variables

Cost Function:
The gradient descent is like
To sum up:
Feature Scaling
Indea: make sure features are on a similar scale, then the gredient descents can converge more quickly

If the scale difference between the features is large, then the related contour of the cost function will be tall and skinny ovals, then gradients may oscillate back and forth and take a long time before it converge to an optimum

Therefore, if the feature scaling is applied, the related contour of the cost function will look more like circles, then gradients may directly find a much more direct path to an optimum

Common method:

dividing by the maximum value to get every feature into approximately a -1≤x≤1 range
mean normalization: subtracting the mean value firstly, then dividing by the range of the feature values to make features have approximately zeros mean

Tips:
1. Where convergence happens: observing the cost function; or do automatic convergence test to declare convergence if J(θ) decreases by less than 10^{-3} in one iteration
2. Making sure gredient descent is working correctly: observing the cost function

If the cost function is increasing, the most possible matter is that the learning rate is too large

For sufficiently small α, J(θ) should decrease on every iteration

If α is too small, gradient descent can be slow to converge

If α is too large, J(θ) may not decrease on every iteration==>may not converge

Therefore, plot J(θ) with different α, and then pick the α that seems to cause J(θ) decrease most rapidly

Create new features

Sometimes, we can create new features to get the key feature to get a better model

Polynomial regression

Sometimes, a straight line can't fit the data set well, then we can use some polynomial functions as hypothesis to get a better model

Normal equation

Function: method to solve for θ analytically
Example:

for m examples (x1,y1), ……, (xm,ym); n features design matrix:

find the minimum of J(θ)
Introduce a method from least square, which is often used to solve equations with no solution, using the knowledge of projection

Gradient descent VS Normal equation

Normal equation and non-invertibility

Issue: if X^{T}X is non-invertible
Situation:

redundant features(linearly dependent): delete the repeated ones
too many features(e.g. m≤n): delete some features; or use regularization

Exercise by python

gradient descent 1.1 plot the data set

path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_1/ex1data2.txt'
data2 = pd.read_csv(path2, header=None, names=['Size', 'Bedrooms', 'Price'])
plt.figure()
ax2 = plt.axes(projection='3d')
ax2.scatter(data2.loc[:, 'Size'], data2.loc[:, 'Bedrooms'], data2.loc[:, 'Price'], color='red', marker='x')
ax2.set_xlabel('the size of the house (in square feet)')
ax2.set_ylabel(' the number of bedrooms')
ax2.set_zlabel('the price of the house')
plt.show()

Output:

1.2 Feature Normalization

data2_normalized = (data2 - data2.mean()) / data2.std()

Output:

Ones Size Bedrooms Price
0 1 0.130010 -0.223675 0.475747
1 1 -0.504190 -0.223675 -0.084074
2 1 0.502476 -0.223675 0.228626
3 1 -0.735723 -1.537767 -0.867025
4 1 1.257476 1.090417 1.595389
5 1 -0.019732 1.090417 -0.323998
6 1 -0.587240 -0.223675 -0.204036
7 1 -0.721881 -0.223675 -1.130948
8 1 -0.781023 -0.223675 -1.026973
9 1 -0.637573 -0.223675 -0.783051
10 1 -0.076357 1.090417 -0.803053
11 1 -0.000857 -0.223675 0.052682
12 1 -0.139273 -0.223675 -0.083283
13 1 3.117292 2.404508 2.874981
14 1 -0.921956 -0.223675 -0.643896
15 1 0.376643 1.090417 0.875619
16 1 -0.856523 -1.537767 -0.323998
17 1 -0.962223 -0.223675 -1.123743
18 1 0.765468 1.090417 1.276275
19 1 1.296484 1.090417 2.068039
20 1 -0.294048 -0.223675 -0.699878
21 1 -0.141790 -1.537767 -0.683083
22 1 -0.499157 -0.223675 -0.779852
23 1 -0.048673 1.090417 -0.643896
24 1 2.377392 -0.223675 1.867303
25 1 -1.133356 -0.223675 -0.723870
26 1 -0.682873 -0.223675 0.992382
27 1 0.661026 -0.223675 1.028370
28 1 0.250810 -0.223675 1.076355
29 1 0.800701 -0.223675 -0.323998
30 1 -0.203448 -1.537767 0.075875
31 1 -1.259189 -2.851859 -1.363666
32 1 0.049477 1.090417 -0.204036
33 1 1.429868 -0.223675 1.915287
34 1 -0.238682 1.090417 -0.435962
35 1 -0.709298 -0.223675 -0.723870
36 1 -0.958448 -0.223675 -0.883819
37 1 0.165243 1.090417 0.036687
38 1 2.786350 1.090417 1.668166
39 1 0.202993 1.090417 -0.427165
40 1 -0.423657 -1.537767 0.224627
41 1 0.298626 -0.223675 -0.084074
42 1 0.712618 1.090417 -0.211234
43 1 -1.007523 -0.223675 -0.331196
44 1 -1.445423 -1.537767 -1.283692
45 1 -0.187090 1.090417 -0.323998
46 1 -1.003748 -0.223675 -0.807044\

1.3 Gradient Descent

def Cost_Function2(x, y, theta):
    inner = np.power(((theta * x.T) - y.T), 2)
    return np.sum(inner) / (2 * len(x))

def Update_theta2(x, y, theta, alpha):
    error = (theta * x.T) - y.T
    error_sum0 = np.sum(error) / len(x)
    error_sum1 = np.sum(error * x[:, 1]) / len(x)
    error_sum2 = np.sum(error * x[:, 2]) / len(x)
    # need another variable to replace θ assignment
    temp = np.matrix(np.zeros(theta.shape))
    temp[0, 0] = theta[0, 0] - alpha * error_sum0
    temp[0, 1] = theta[0, 1] - alpha * error_sum1
    temp[0, 2] = theta[0, 2] - alpha * error_sum2
    theta = temp
    return theta

if __name__ == '__main__':
    # add an additional first column to X and set it to all ones
    data2_normalized.insert(0, 'Ones', 1)
    # take the value of x[ones,population] and y[profit]
    x2 = data2_normalized.loc[:, ['Ones', 'Size', 'Bedrooms']]
    x2 = np.matrix(x2.values)
    y2 = data2_normalized.loc[:, ['Price']]
    y2 = np.matrix(y2.values)
    # initialize theta
    theta2 = np.matrix(np.array([0, 0, 0]))
    # compute the cost function
    J2 = Cost_Function2(x2, y2, theta2)

    # gradient descent
    alpha2 = 0.01
    iterations2 = 1500
    j2 = np.zeros((iterations2, 1))
    theta_all2 = theta2
    for i in range(iterations2):
        # compute the cost function
        j2[i] = Cost_Function2(x2, y2, theta2)
        # update theta
        theta2 = Update_theta2(x2, y2, theta2, alpha2)
        theta_all2 = np.row_stack((theta_all2, theta2))
    theta_iterations = np.delete(theta_all2, 0, 0)

Output:

-1.10815612e-16 8.84042349e-01 -5.24551809e-02

1.4 prediction

information3 = np.matrix(np.array([1, (1650-data2.loc[:, 'Size'].mean())/data2.loc[:, 'Size'].std(), (3-data2.loc[:, 'Bedrooms'].mean())/data2.loc[:, 'Bedrooms'].std()]))
prediction3_normalized = theta2 * information3.T
prediction3 = prediction3_normalized * data2.loc[:, 'Price'].std() +data2.loc[:, 'Price'].mean()
print('a price prediction for a 1650-square-foot house with 3 bedrooms is ', prediction3)

Output:

293101.15341756

1.5 plot cost function with iterations

plt.figure()
ii = range(iterations2)
plt.plot(ii, j2, color='green')
plt.xlabel('Number of iterations')
plt.ylabel('Cost Function')
plt.show()

Output:

normal equation 2.1 normal equation function

def Normal_Equation(X, y):
    theta = np.linalg.inv(X.T * X) * X.T * y
    return theta

2.2 main function

if __name__ == '__main__':
    data2.insert(0, 'Ones', 1)
    x3 = data2.loc[:, ['Ones', 'Size', 'Bedrooms']]
    x3 = np.matrix(x3.values)
    y3 = data2.loc[:, ['Price']]
    y3 = np.matrix(y3.values)
    theta3 = Normal_Equation(x3, y3)

Output:

[[89597.9095428 ]
[ 139.21067402]
[-8738.01911233]]

2.3 prediction

information4 = np.matrix(np.array([1, 1650, 3]))
prediction4 = theta3.T * information4.T
print('a price prediction for a 1650-square-foot house with 3 bedrooms is ', prediction4)

Output:

293081.4643349