1. Linear Regression - ZYL-Harry/Machine_Learning_study GitHub Wiki
Linear regression with one variable
Process of a supervised learning algorithm
- given a training set
- feed the training set to the learning algorithm
- from the learning algorithm, get an output function h(hypothesis)
- with the hypothesis and an input, get output(the estimated value)
Start with fitting linear functions, and then build on this to more complex models and more complex learning algorithms
How to represent hypothesis
Hypothesis: use to make predictions
How to chose θ
Method: find the values of θs so that cost function is minimized
cost function: Squared error function:
Cost function
To better visulize the cost function, we do this:
Discuss simplized hypothesis and cost function:
For each value of θ, it corresponds to a different hypothesis and a different value of cost function
Then, it is easy to find the value of θ which can minimize the cost function and find the proper hypothesis
If there are two θs, then the cost function will be a 3-dimensional surface plot-a convex function:
To better visualize it, the contour figure is often used:
Gradient descent
- Function: minimizing the cost function
- Procedure:
- Find an initial point to start
- Take a step in the direction of steepest descent
- Repeat step2, until converge to a local optimum
- Mathematical expression:
- α is called the learning rate, which controls how big a step we take downhill with gradient descent(updating the parameter θj), it is always a positive number
Q: If the parameter θ is already at a local minimum, what will on step of gradient descent do?
A: Then, the parameter θ won't change, it will keep the solution at the local optimum.
\- Gradient descent => simultaneous update
- The derivative term:
Gradient descent for linear regression
Then, put together gradient descent with the cost function to get an algorithm for linear regression for fitting a straight line to our data
- Key: the derivative term:
Then, update the parameter θs simultaneously
The process in the contour figure is shown:
Tip:
- "Batch" Gradient Descent---Each step of gradient descent uses all the training examples
- Gradient Descent is better when solving problems with large data set
- There are also some other methods to solve the minimum of the cost function, like normal equations methods
Exercise by python
1. Create a 5*5 identity matrix
A = np.eye(5)
Output:
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
2. Linear regression with one variable
2.1 Plotting the data
data = pd.read_csv(path, header=None, names=['Population', 'Profit'])
plt.figure()
data.plot(kind='scatter', x='Population', y='Profit', color='red', marker='x', figsize=(10,10), xlabel='Population of City in 10,000s', ylabel='Profit in $10,000s')
plt.show()
Output:
2.2 main function
if __name__ == '__main__':
# add an additional first column to X and set it to all ones
data.insert(0, 'Ones', 1)
# take the value of x[ones,population] and y[profit]
x = data.loc[:, ['Ones', 'Population']]
x = np.matrix(x.values)
y = data.loc[:, ['Profit']]
y = np.matrix(y.values)
# initialize theta
theta = np.matrix(np.array([0, 0]))
# compute the cost function
J = Cost_Function(x, y, theta)
# gradient descent
alpha = 0.01
iterations = 1500
j = np.zeros((iterations, 1))
theta_all = theta
for i in range(iterations):
# compute the cost function
j[i] = Cost_Function(x, y, theta)
# update theta
theta = Update_theta(x, y, theta, alpha)
theta_all = np.row_stack((theta_all, theta))
theta_iterations = np.delete(theta_all, 0, 0)
2.3 Cost Function
def Cost_Function(x, y, theta):
inner = np.power(((theta * x.T) - y.T), 2)
return np.sum(inner) / (2 * len(x))
2.4 Update θs
def Update_theta(x, y, theta, alpha):
error = (theta * x.T) - y.T
error_sum0 = np.sum(error) / len(x)
error_sum1 = np.sum(error * x[:, 1]) / len(x)
# need another variable to replace θ assignment
temp = np.matrix(np.zeros(theta.shape))
temp[0, 0] = theta[0, 0] - alpha * error_sum0
temp[0, 1] = theta[0, 1] - alpha * error_sum1
theta = temp
return theta
Output:
2.5 plot the regression figure
xx = np.linspace(data.Population.min(), data.Population.max(), 100)
f = theta[0, 0] + theta[0, 1] * xx
plt.figure()
data.plot(kind='scatter', x='Population', y='Profit', color='red', marker='x', figsize=(10, 10),xlabel='Population of City in 10,000s', ylabel='Profit in $10,000s')
plt.plot(xx, f, color='blue')
plt.show()
Output:
2.6 visualize the cost function
fig = plt.figure()
ax = plt.axes(projection='3d')
theta0, theta1 = np.meshgrid(theta_iterations[:, 0].ravel().T, theta_iterations[:, 1].ravel().T)
j_plot = np.zeros((iterations,iterations))
for i in range(iterations):
for j in range(iterations):
theta_temp = [theta0[i, j], theta1[i, j]]
j_plot[i, j] = Cost_Function(x, y, theta_temp)
ax.plot_surface(theta0, theta1, j_plot, cmap='rainbow')
ax.set_xlabel('θ0')
ax.set_ylabel('θ1')
ax.set_zlabel('J')
plt.show()
Output:
2.7 plot the contour figure
plt.figure()
plt.contour(theta0, theta1, j_plot)
plt.plot(theta[0, 0], theta[0, 1], color='red', marker='x')
plt.xlabel('θ0')
plt.ylabel('θ1')
plt.show()
Output:
2.8 predictions
information1 = np.matrix(np.array([1, 3.5]))
prediction1 = theta * information1.T
print(prediction1)
information2 = np.matrix(np.array([1, 7]))
prediction2 = theta * information2.T
print(prediction2)
Output:
Linear regression with multiple variables
New type of hypothesis
For convenience of notation, we have the following process:
Gradient descent for linear regression with multiple variables
-
Cost Function:
-
The gradient descent is like
-
To sum up:
-
Feature Scaling
Indea: make sure features are on a similar scale, then the gredient descents can converge more quickly
- If the scale difference between the features is large, then the related contour of the cost function will be tall and skinny ovals, then gradients may oscillate back and forth and take a long time before it converge to an optimum
- Therefore, if the feature scaling is applied, the related contour of the cost function will look more like circles, then gradients may directly find a much more direct path to an optimum
Common method:
- dividing by the maximum value to get every feature into approximately a -1≤x≤1 range
- mean normalization: subtracting the mean value firstly, then dividing by the range of the feature values to make features have approximately zeros mean
- Tips:
1. Where convergence happens: observing the cost function; or do automatic convergence test to declare convergence if J(θ) decreases by less than 10^{-3} in one iteration
2. Making sure gredient descent is working correctly: observing the cost function
If the cost function is increasing, the most possible matter is that the learning rate is too large
- For sufficiently small α, J(θ) should decrease on every iteration
- If α is too small, gradient descent can be slow to converge
- If α is too large, J(θ) may not decrease on every iteration==>may not converge
Therefore, plot J(θ) with different α, and then pick the α that seems to cause J(θ) decrease most rapidly
Create new features
Sometimes, we can create new features to get the key feature to get a better model
Polynomial regression
Sometimes, a straight line can't fit the data set well, then we can use some polynomial functions as hypothesis to get a better model
Normal equation
- Function: method to solve for θ analytically
- Example:
for m examples (x1,y1), ……, (xm,ym); n features design matrix:
- find the minimum of J(θ)
Introduce a method from least square, which is often used to solve equations with no solution, using the knowledge of projection
Gradient descent VS Normal equation
Normal equation and non-invertibility
Issue: if X^{T}X is non-invertible
Situation:
- redundant features(linearly dependent): delete the repeated ones
- too many features(e.g. m≤n): delete some features; or use regularization
Exercise by python
- gradient descent 1.1 plot the data set
path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_1/ex1data2.txt'
data2 = pd.read_csv(path2, header=None, names=['Size', 'Bedrooms', 'Price'])
plt.figure()
ax2 = plt.axes(projection='3d')
ax2.scatter(data2.loc[:, 'Size'], data2.loc[:, 'Bedrooms'], data2.loc[:, 'Price'], color='red', marker='x')
ax2.set_xlabel('the size of the house (in square feet)')
ax2.set_ylabel(' the number of bedrooms')
ax2.set_zlabel('the price of the house')
plt.show()
Output:
1.2 Feature Normalization
data2_normalized = (data2 - data2.mean()) / data2.std()
Output:
Ones Size Bedrooms Price
0 1 0.130010 -0.223675 0.475747
1 1 -0.504190 -0.223675 -0.084074
2 1 0.502476 -0.223675 0.228626
3 1 -0.735723 -1.537767 -0.867025
4 1 1.257476 1.090417 1.595389
5 1 -0.019732 1.090417 -0.323998
6 1 -0.587240 -0.223675 -0.204036
7 1 -0.721881 -0.223675 -1.130948
8 1 -0.781023 -0.223675 -1.026973
9 1 -0.637573 -0.223675 -0.783051
10 1 -0.076357 1.090417 -0.803053
11 1 -0.000857 -0.223675 0.052682
12 1 -0.139273 -0.223675 -0.083283
13 1 3.117292 2.404508 2.874981
14 1 -0.921956 -0.223675 -0.643896
15 1 0.376643 1.090417 0.875619
16 1 -0.856523 -1.537767 -0.323998
17 1 -0.962223 -0.223675 -1.123743
18 1 0.765468 1.090417 1.276275
19 1 1.296484 1.090417 2.068039
20 1 -0.294048 -0.223675 -0.699878
21 1 -0.141790 -1.537767 -0.683083
22 1 -0.499157 -0.223675 -0.779852
23 1 -0.048673 1.090417 -0.643896
24 1 2.377392 -0.223675 1.867303
25 1 -1.133356 -0.223675 -0.723870
26 1 -0.682873 -0.223675 0.992382
27 1 0.661026 -0.223675 1.028370
28 1 0.250810 -0.223675 1.076355
29 1 0.800701 -0.223675 -0.323998
30 1 -0.203448 -1.537767 0.075875
31 1 -1.259189 -2.851859 -1.363666
32 1 0.049477 1.090417 -0.204036
33 1 1.429868 -0.223675 1.915287
34 1 -0.238682 1.090417 -0.435962
35 1 -0.709298 -0.223675 -0.723870
36 1 -0.958448 -0.223675 -0.883819
37 1 0.165243 1.090417 0.036687
38 1 2.786350 1.090417 1.668166
39 1 0.202993 1.090417 -0.427165
40 1 -0.423657 -1.537767 0.224627
41 1 0.298626 -0.223675 -0.084074
42 1 0.712618 1.090417 -0.211234
43 1 -1.007523 -0.223675 -0.331196
44 1 -1.445423 -1.537767 -1.283692
45 1 -0.187090 1.090417 -0.323998
46 1 -1.003748 -0.223675 -0.807044\
1.3 Gradient Descent
def Cost_Function2(x, y, theta):
inner = np.power(((theta * x.T) - y.T), 2)
return np.sum(inner) / (2 * len(x))
def Update_theta2(x, y, theta, alpha):
error = (theta * x.T) - y.T
error_sum0 = np.sum(error) / len(x)
error_sum1 = np.sum(error * x[:, 1]) / len(x)
error_sum2 = np.sum(error * x[:, 2]) / len(x)
# need another variable to replace θ assignment
temp = np.matrix(np.zeros(theta.shape))
temp[0, 0] = theta[0, 0] - alpha * error_sum0
temp[0, 1] = theta[0, 1] - alpha * error_sum1
temp[0, 2] = theta[0, 2] - alpha * error_sum2
theta = temp
return theta
if __name__ == '__main__':
# add an additional first column to X and set it to all ones
data2_normalized.insert(0, 'Ones', 1)
# take the value of x[ones,population] and y[profit]
x2 = data2_normalized.loc[:, ['Ones', 'Size', 'Bedrooms']]
x2 = np.matrix(x2.values)
y2 = data2_normalized.loc[:, ['Price']]
y2 = np.matrix(y2.values)
# initialize theta
theta2 = np.matrix(np.array([0, 0, 0]))
# compute the cost function
J2 = Cost_Function2(x2, y2, theta2)
# gradient descent
alpha2 = 0.01
iterations2 = 1500
j2 = np.zeros((iterations2, 1))
theta_all2 = theta2
for i in range(iterations2):
# compute the cost function
j2[i] = Cost_Function2(x2, y2, theta2)
# update theta
theta2 = Update_theta2(x2, y2, theta2, alpha2)
theta_all2 = np.row_stack((theta_all2, theta2))
theta_iterations = np.delete(theta_all2, 0, 0)
Output:
1.4 prediction
information3 = np.matrix(np.array([1, (1650-data2.loc[:, 'Size'].mean())/data2.loc[:, 'Size'].std(), (3-data2.loc[:, 'Bedrooms'].mean())/data2.loc[:, 'Bedrooms'].std()]))
prediction3_normalized = theta2 * information3.T
prediction3 = prediction3_normalized * data2.loc[:, 'Price'].std() +data2.loc[:, 'Price'].mean()
print('a price prediction for a 1650-square-foot house with 3 bedrooms is ', prediction3)
Output:
1.5 plot cost function with iterations
plt.figure()
ii = range(iterations2)
plt.plot(ii, j2, color='green')
plt.xlabel('Number of iterations')
plt.ylabel('Cost Function')
plt.show()
Output:
- normal equation 2.1 normal equation function
def Normal_Equation(X, y):
theta = np.linalg.inv(X.T * X) * X.T * y
return theta
2.2 main function
if __name__ == '__main__':
data2.insert(0, 'Ones', 1)
x3 = data2.loc[:, ['Ones', 'Size', 'Bedrooms']]
x3 = np.matrix(x3.values)
y3 = data2.loc[:, ['Price']]
y3 = np.matrix(y3.values)
theta3 = Normal_Equation(x3, y3)
Output:
[[89597.9095428 ]
[ 139.21067402]
[-8738.01911233]]
2.3 prediction
information4 = np.matrix(np.array([1, 1650, 3]))
prediction4 = theta3.T * information4.T
print('a price prediction for a 1650-square-foot house with 3 bedrooms is ', prediction4)
Output: