1. Linear Regression - ZYL-Harry/Machine_Learning_study GitHub Wiki

Linear regression with one variable

Process of a supervised learning algorithm

  1. given a training set
  2. feed the training set to the learning algorithm
  3. from the learning algorithm, get an output function h(hypothesis)
  4. with the hypothesis and an input, get output(the estimated value)

image

Start with fitting linear functions, and then build on this to more complex models and more complex learning algorithms

How to represent hypothesis

Hypothesis: use to make predictions

image

How to chose θ

Method: find the values of θs so that cost function is minimized
image
cost function: Squared error function:
image

Cost function

To better visulize the cost function, we do this:
image

Discuss simplized hypothesis and cost function:
For each value of θ, it corresponds to a different hypothesis and a different value of cost function
image


image


image


image


Then, it is easy to find the value of θ which can minimize the cost function and find the proper hypothesis


If there are two θs, then the cost function will be a 3-dimensional surface plot-a convex function:

image

To better visualize it, the contour figure is often used:

image

Gradient descent

  • Function: minimizing the cost function
  • Procedure:
  1. Find an initial point to start
  2. Take a step in the direction of steepest descent
  3. Repeat step2, until converge to a local optimum
  • Mathematical expression:

image

  • α is called the learning rate, which controls how big a step we take downhill with gradient descent(updating the parameter θj), it is always a positive number image
    Q: If the parameter θ is already at a local minimum, what will on step of gradient descent do?
    A: Then, the parameter θ won't change, it will keep the solution at the local optimum.
    image image\
  • Gradient descent => simultaneous update
  • The derivative term:
    image

Gradient descent for linear regression

Then, put together gradient descent with the cost function to get an algorithm for linear regression for fitting a straight line to our data
image

  • Key: the derivative term:
    image
    image
    Then, update the parameter θs simultaneously
    The process in the contour figure is shown:
    image

Tip:

  • "Batch" Gradient Descent---Each step of gradient descent uses all the training examples
  • Gradient Descent is better when solving problems with large data set
  • There are also some other methods to solve the minimum of the cost function, like normal equations methods

Exercise by python

1. Create a 5*5 identity matrix

A = np.eye(5)

Output:

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]

2. Linear regression with one variable

2.1 Plotting the data

data = pd.read_csv(path, header=None, names=['Population', 'Profit'])
plt.figure()
data.plot(kind='scatter', x='Population', y='Profit', color='red', marker='x', figsize=(10,10), xlabel='Population of City in 10,000s', ylabel='Profit in $10,000s')
plt.show()

Output:
image

2.2 main function

if __name__ == '__main__':
    # add an additional first column to X and set it to all ones
    data.insert(0, 'Ones', 1)

    # take the value of x[ones,population] and y[profit]
    x = data.loc[:, ['Ones', 'Population']]
    x = np.matrix(x.values)
    y = data.loc[:, ['Profit']]
    y = np.matrix(y.values)
    # initialize theta
    theta = np.matrix(np.array([0, 0]))
    # compute the cost function
    J = Cost_Function(x, y, theta)

    # gradient descent
    alpha = 0.01
    iterations = 1500
    j = np.zeros((iterations, 1))
    theta_all = theta
    for i in range(iterations):
        # compute the cost function
        j[i] = Cost_Function(x, y, theta)
        # update theta
        theta = Update_theta(x, y, theta, alpha)
        theta_all = np.row_stack((theta_all, theta))
    theta_iterations = np.delete(theta_all, 0, 0)

2.3 Cost Function
image

def Cost_Function(x, y, theta):
    inner = np.power(((theta * x.T) - y.T), 2)
    return np.sum(inner) / (2 * len(x))

2.4 Update θs

def Update_theta(x, y, theta, alpha):
    error = (theta * x.T) - y.T
    error_sum0 = np.sum(error) / len(x)
    error_sum1 = np.sum(error * x[:, 1]) / len(x)
    # need another variable to replace θ assignment
    temp = np.matrix(np.zeros(theta.shape))
    temp[0, 0] = theta[0, 0] - alpha * error_sum0
    temp[0, 1] = theta[0, 1] - alpha * error_sum1
    theta = temp
    return theta

Output:

-3.63029144 1.16636235

2.5 plot the regression figure

xx = np.linspace(data.Population.min(), data.Population.max(), 100)
f = theta[0, 0] + theta[0, 1] * xx
plt.figure()
data.plot(kind='scatter', x='Population', y='Profit', color='red', marker='x', figsize=(10, 10),xlabel='Population of City in 10,000s', ylabel='Profit in $10,000s')
plt.plot(xx, f, color='blue')
plt.show()

Output:
image

2.6 visualize the cost function

fig = plt.figure()
ax = plt.axes(projection='3d')
theta0, theta1 = np.meshgrid(theta_iterations[:, 0].ravel().T, theta_iterations[:, 1].ravel().T)
j_plot = np.zeros((iterations,iterations))
for i in range(iterations):
    for j in range(iterations):
        theta_temp = [theta0[i, j], theta1[i, j]]
        j_plot[i, j] = Cost_Function(x, y, theta_temp)
ax.plot_surface(theta0, theta1, j_plot, cmap='rainbow')
ax.set_xlabel('θ0')
ax.set_ylabel('θ1')
ax.set_zlabel('J')
plt.show()

Output:
image

2.7 plot the contour figure

plt.figure()
plt.contour(theta0, theta1, j_plot)
plt.plot(theta[0, 0], theta[0, 1], color='red', marker='x')
plt.xlabel('θ0')
plt.ylabel('θ1')
plt.show()

Output:
image

2.8 predictions

information1 = np.matrix(np.array([1, 3.5]))
prediction1 = theta * information1.T
print(prediction1)
information2 = np.matrix(np.array([1, 7]))
prediction2 = theta * information2.T
print(prediction2)

Output:

0.45197679
4.53424501

Linear regression with multiple variables

New type of hypothesis

image
For convenience of notation, we have the following process:
image

Gradient descent for linear regression with multiple variables

  • Cost Function:
    image

  • The gradient descent is like
    image

  • To sum up:
    image

  • Feature Scaling
    Indea: make sure features are on a similar scale, then the gredient descents can converge more quickly

  • If the scale difference between the features is large, then the related contour of the cost function will be tall and skinny ovals, then gradients may oscillate back and forth and take a long time before it converge to an optimum
  • Therefore, if the feature scaling is applied, the related contour of the cost function will look more like circles, then gradients may directly find a much more direct path to an optimum
    image

Common method:

  1. dividing by the maximum value to get every feature into approximately a -1≤x≤1 range
    image
  2. mean normalization: subtracting the mean value firstly, then dividing by the range of the feature values to make features have approximately zeros mean
    image
  • Tips:
    1. Where convergence happens: observing the cost function; or do automatic convergence test to declare convergence if J(θ) decreases by less than 10^{-3} in one iteration
    2. Making sure gredient descent is working correctly: observing the cost function

If the cost function is increasing, the most possible matter is that the learning rate is too large

  • For sufficiently small α, J(θ) should decrease on every iteration
  • If α is too small, gradient descent can be slow to converge
  • If α is too large, J(θ) may not decrease on every iteration==>may not converge

Therefore, plot J(θ) with different α, and then pick the α that seems to cause J(θ) decrease most rapidly

Create new features

Sometimes, we can create new features to get the key feature to get a better model
image

Polynomial regression

Sometimes, a straight line can't fit the data set well, then we can use some polynomial functions as hypothesis to get a better model
image
image

Normal equation

  • Function: method to solve for θ analytically
  • Example:
    image

for m examples (x1,y1), ……, (xm,ym); n features design matrix:
image

  • find the minimum of J(θ)
    Introduce a method from least square, which is often used to solve equations with no solution, using the knowledge of projection
    image

Gradient descent VS Normal equation

image

Normal equation and non-invertibility

Issue: if X^{T}X is non-invertible
Situation:

  • redundant features(linearly dependent): delete the repeated ones
  • too many features(e.g. m≤n): delete some features; or use regularization

Exercise by python

  1. gradient descent 1.1 plot the data set
path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_1/ex1data2.txt'
data2 = pd.read_csv(path2, header=None, names=['Size', 'Bedrooms', 'Price'])
plt.figure()
ax2 = plt.axes(projection='3d')
ax2.scatter(data2.loc[:, 'Size'], data2.loc[:, 'Bedrooms'], data2.loc[:, 'Price'], color='red', marker='x')
ax2.set_xlabel('the size of the house (in square feet)')
ax2.set_ylabel(' the number of bedrooms')
ax2.set_zlabel('the price of the house')
plt.show()

Output:
image

1.2 Feature Normalization

data2_normalized = (data2 - data2.mean()) / data2.std()

Output:

Ones Size Bedrooms Price
0 1 0.130010 -0.223675 0.475747
1 1 -0.504190 -0.223675 -0.084074
2 1 0.502476 -0.223675 0.228626
3 1 -0.735723 -1.537767 -0.867025
4 1 1.257476 1.090417 1.595389
5 1 -0.019732 1.090417 -0.323998
6 1 -0.587240 -0.223675 -0.204036
7 1 -0.721881 -0.223675 -1.130948
8 1 -0.781023 -0.223675 -1.026973
9 1 -0.637573 -0.223675 -0.783051
10 1 -0.076357 1.090417 -0.803053
11 1 -0.000857 -0.223675 0.052682
12 1 -0.139273 -0.223675 -0.083283
13 1 3.117292 2.404508 2.874981
14 1 -0.921956 -0.223675 -0.643896
15 1 0.376643 1.090417 0.875619
16 1 -0.856523 -1.537767 -0.323998
17 1 -0.962223 -0.223675 -1.123743
18 1 0.765468 1.090417 1.276275
19 1 1.296484 1.090417 2.068039
20 1 -0.294048 -0.223675 -0.699878
21 1 -0.141790 -1.537767 -0.683083
22 1 -0.499157 -0.223675 -0.779852
23 1 -0.048673 1.090417 -0.643896
24 1 2.377392 -0.223675 1.867303
25 1 -1.133356 -0.223675 -0.723870
26 1 -0.682873 -0.223675 0.992382
27 1 0.661026 -0.223675 1.028370
28 1 0.250810 -0.223675 1.076355
29 1 0.800701 -0.223675 -0.323998
30 1 -0.203448 -1.537767 0.075875
31 1 -1.259189 -2.851859 -1.363666
32 1 0.049477 1.090417 -0.204036
33 1 1.429868 -0.223675 1.915287
34 1 -0.238682 1.090417 -0.435962
35 1 -0.709298 -0.223675 -0.723870
36 1 -0.958448 -0.223675 -0.883819
37 1 0.165243 1.090417 0.036687
38 1 2.786350 1.090417 1.668166
39 1 0.202993 1.090417 -0.427165
40 1 -0.423657 -1.537767 0.224627
41 1 0.298626 -0.223675 -0.084074
42 1 0.712618 1.090417 -0.211234
43 1 -1.007523 -0.223675 -0.331196
44 1 -1.445423 -1.537767 -1.283692
45 1 -0.187090 1.090417 -0.323998
46 1 -1.003748 -0.223675 -0.807044\

1.3 Gradient Descent

def Cost_Function2(x, y, theta):
    inner = np.power(((theta * x.T) - y.T), 2)
    return np.sum(inner) / (2 * len(x))

def Update_theta2(x, y, theta, alpha):
    error = (theta * x.T) - y.T
    error_sum0 = np.sum(error) / len(x)
    error_sum1 = np.sum(error * x[:, 1]) / len(x)
    error_sum2 = np.sum(error * x[:, 2]) / len(x)
    # need another variable to replace θ assignment
    temp = np.matrix(np.zeros(theta.shape))
    temp[0, 0] = theta[0, 0] - alpha * error_sum0
    temp[0, 1] = theta[0, 1] - alpha * error_sum1
    temp[0, 2] = theta[0, 2] - alpha * error_sum2
    theta = temp
    return theta

if __name__ == '__main__':
    # add an additional first column to X and set it to all ones
    data2_normalized.insert(0, 'Ones', 1)
    # take the value of x[ones,population] and y[profit]
    x2 = data2_normalized.loc[:, ['Ones', 'Size', 'Bedrooms']]
    x2 = np.matrix(x2.values)
    y2 = data2_normalized.loc[:, ['Price']]
    y2 = np.matrix(y2.values)
    # initialize theta
    theta2 = np.matrix(np.array([0, 0, 0]))
    # compute the cost function
    J2 = Cost_Function2(x2, y2, theta2)

    # gradient descent
    alpha2 = 0.01
    iterations2 = 1500
    j2 = np.zeros((iterations2, 1))
    theta_all2 = theta2
    for i in range(iterations2):
        # compute the cost function
        j2[i] = Cost_Function2(x2, y2, theta2)
        # update theta
        theta2 = Update_theta2(x2, y2, theta2, alpha2)
        theta_all2 = np.row_stack((theta_all2, theta2))
    theta_iterations = np.delete(theta_all2, 0, 0)

Output:

-1.10815612e-16 8.84042349e-01 -5.24551809e-02

1.4 prediction

information3 = np.matrix(np.array([1, (1650-data2.loc[:, 'Size'].mean())/data2.loc[:, 'Size'].std(), (3-data2.loc[:, 'Bedrooms'].mean())/data2.loc[:, 'Bedrooms'].std()]))
prediction3_normalized = theta2 * information3.T
prediction3 = prediction3_normalized * data2.loc[:, 'Price'].std() +data2.loc[:, 'Price'].mean()
print('a price prediction for a 1650-square-foot house with 3 bedrooms is ', prediction3)

Output:

293101.15341756

1.5 plot cost function with iterations

plt.figure()
ii = range(iterations2)
plt.plot(ii, j2, color='green')
plt.xlabel('Number of iterations')
plt.ylabel('Cost Function')
plt.show()

Output:
image

  1. normal equation 2.1 normal equation function
def Normal_Equation(X, y):
    theta = np.linalg.inv(X.T * X) * X.T * y
    return theta

2.2 main function

if __name__ == '__main__':
    data2.insert(0, 'Ones', 1)
    x3 = data2.loc[:, ['Ones', 'Size', 'Bedrooms']]
    x3 = np.matrix(x3.values)
    y3 = data2.loc[:, ['Price']]
    y3 = np.matrix(y3.values)
    theta3 = Normal_Equation(x3, y3)

Output:

[[89597.9095428 ]
[ 139.21067402]
[-8738.01911233]]

2.3 prediction

information4 = np.matrix(np.array([1, 1650, 3]))
prediction4 = theta3.T * information4.T
print('a price prediction for a 1650-square-foot house with 3 bedrooms is ', prediction4)

Output:

293081.4643349