2. Logistic Regression - ZYL-Harry/Machine_Learning_study GitHub Wiki
Classification
- representing method:
y∈{0,1}, where "0" refers to the "Negative Class", while "1" refers to the "Positive Class"
Start with classification problems with just two classes 0 and 1, later we'll talk about multi-class problems as well
- Problem:
Therefore, the linear regression(hypohesis may >1 or <0 when the training data set are all 0 or 1, which is strange) isn't often used in classification problems, then, the logistic regression(0<hypothesis<1) is ofthen applied as a classification algorithm
Logistic Function
To let the output of the hypothesis is between 0 and 1, 0≤h(θ)≤1, we introduce a new function, Sigmoid Function(Logistic Function):
Then, the new hypothesis function:
Then, Given a training set, we need to pick a value for θ, and then use the hypothesis to make predictions
- Interpretation of Hypothesis Output
In the classification problems, the output of hypothesis is the estimated probability that y=1 on input x
Mathematical representation:
Decision boundary(KEY)
- Qusetion: what is the logistic regression hypothesis function is computing
- Concept: a property of the hypothesis, we use the training data set to fit the parameter θ by cost function, but the decision boundary is not related to the training set
- Function:
form a straight line, which separats the region where the hypothesis predicts "y=1" from where the hypothesis predicts "y=0" - Method:
Suppose predict "y=1" if "h(x)≥0.5" where x≥0, predict "y=0" if "h(x)≤0.5" where x≤0
So, in the hypothesis, "y=1" if h(θx)≥0.5, where θx≥0, "y=0" if h(θx)≤0, where θx≤0 - Example:
Cost function
-
m training set:
-
n features:
-
Hypothesis:
-
Cost function:
To simplify the process of analuyze, we extract the main part:
But if we still use this kind of hypothesis function, we can only get a non-convex function:
Therefore, we need to come up with a new kind of hypothesis function -
Logistic regression cost function:
The main part of the logistic regression cost function is in the following, which is come from the principle of maximum likelihood estimation:
-
Simplified cost function
The main part of the cost function is simplified into a more compressed one:
Then, the logistc regression cost function becomes
How to choose parameters θ?
To fit parameters θ, we still need to find the minimum of the cost function J(θ), the method we use is still the gradient descent, like doing in the linear regression
From the formula, we can find that it looks identical to the one in the linear regression, the difference is that the hypothesis functions are different
-
Tips:
1. We can still plot the cost function to make sure whether it works correctly(convergence) or not
2. We can still use feature scaling to let the gradient descent works more quickly -
Other optimization algorithms:
Multi-class classification: One-vs-all
- Method:
- Train a logistic regression classifier for each class i to predict the probability that y=i
- On a new input x to make a prediction, pick the class i that maximizes classifier among the classifiers, which is the most confident, or most enthusiacal
Exercise by python
- Visualizing the data
# 1.1.1 get the data
path = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_2/ex2data1.txt'
data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admission'])
# 1.1.2 classify the data
positive = data[data.Admission.isin([1])]
negative = data[data.Admission.isin([0])]
# 1.1.3 plot the data
plt.figure()
pos = plt.scatter(x=positive['Exam 1'], y=positive['Exam 2'], color='k', marker='+')
neg = plt.scatter(x=negative['Exam 1'], y=negative['Exam 2'], color='y', marker='o')
plt.legend([pos, neg], ['Admitted', 'Unadmitted'])
plt.xlabel('Exam 1')
plt.ylabel('Exam 2')
plt.show()
Output:
- Implementation 2.1 sigmoid function
def Sigmoid_Function(z):
g = 1 / (1 + np.exp(-z))
return g
2.2 cost function
def Cost_Function(theta, x, y):
first = - y.T * np.log(Sigmoid_Function(theta * x.T)).T
second = - (1 - y).T * np.log(1 - Sigmoid_Function(theta * x.T)).T
J = (1 / len(x)) * np.sum(first + second)
return J
2.3 gradient descent
- Only compute the partial part
def Gradient_Descent2(theta, x, y):
temp = np.matrix(np.zeros(theta.shape))
error = Sigmoid_Function(theta * x.T) - y.T
for i in range(temp.shape[1]):
partial_part = error * x[:, i]
temp[0, i] = (1 / len(x)) * np.sum(partial_part)
theta_partial = temp
return theta_partial
2.4 main function
if __name__ == '__main__':
data.insert(0, 'Ones', 1)
x = data.loc[:, ['Ones', 'Exam 1', 'Exam 2']]
x = np.matrix(x.values)
y = data.loc[:, ['Admission']]
y = np.matrix(y.values)
theta = np.matrix(np.zeros(3))
J0 = Cost_Function(theta, x, y)
# do gradient descent with established function
result = opt.fmin_tnc(func=Cost_Function, x0=theta, fprime=Gradient_Descent2, args=(x, y))
theta2 = result[0]
print('θ_fmin_tnc = ', theta2)
Output:
θ_fmin_tnc = [-25.16131862 0.20623159 0.20147149]
opt.fmin_tnc
is in the scipy.optimize
library, which is used to do optimizing algorithms, whose method of application can be seen in the following url: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_tnc.html
2.5 prediction
information2 = np.matrix(np.array([1, 45, 85]))
prediction2 = Sigmoid_Function(theta2 * information2.T)
print('new prediction to [45, 85] is ', prediction2)
Output:
new prediction to [45, 85] is 0.77629062
2.6 plot the regression figure
xxx = np.linspace(30, 100, 100)
yyy = (theta2[0] + theta2[1] * xxx) / (-theta2[2])
plt.figure()
pos = plt.scatter(x=positive['Exam 1'], y=positive['Exam 2'], color='k', marker='+')
neg = plt.scatter(x=negative['Exam 1'], y=negative['Exam 2'], color='y', marker='o')
plt.legend([pos, neg], ['Admitted', 'Unadmitted'])
plt.plot(xxx, yyy, color='b')
plt.xlabel('Exam 1')
plt.ylabel('Exam 2')
plt.show()
Output:
2.7 check the accuracy
predictions = Sigmoid_Function(theta2 * x.T)
predictions_classification = np.matrix(np.zeros(predictions.shape))
for i in range(predictions.shape[1]):
if predictions[0, i] >= 0.5:
predictions_classification[0, i] = 1
else:
predictions_classification[0, i] = 0
correct = np.matrix(np.zeros(predictions_classification.shape))
j = 0
for i in range(predictions.shape[1]):
if ((predictions_classification[0, i] == 1 and y[i, 0] == 1) or (predictions_classification[0, i] == 0 and y[i, 0] == 0)):
correct[0, j] = 1
else:
correct[0, j] = 0
j = j + 1
accuracy = (np.sum(correct, axis=1) / correct.shape[1]) * 100
print('accuracy = {0}%'.format(accuracy[0, 0]))
Output:
accuracy = 89.0%
Overfitting Problem
- Underfitting: the regression doesn't fit the data set very well---high bias
- Overfitting: the regression may pass through the training set very well, but it may be a very wiggly curve, which goes up and down all over the place---high variance
- Cause: If we are fitting a high order polynomial, then the hypothesis can fit almost any function, but this face of possible hypothesis is too large, and we don't have enough data to constrain it to give us a good hypothesis(a function of θ and x)
- Addressing:
Regularization
- Intuition:
Idea---Small values for parameters θ0, θ1, ..., θn can have simpler hypothesis and less prone to overfitting - Cost Function:
λ is the regularization parameter used to keep the balance between training the data set well and keeping the parameter θ small, and therefore keeping the hypothesis relatively simple to avoid overfitting
If λ is set to an extremely large value, then h_θ(x)=θ0
Regularized linear regression
Gradient descent
- The change to the cost function and gradient descent:
Norm equation
- The change to the cost function and the equation of θ:
If λ>0, then the matrix can be invertible even with large number of features and only a few training set
Regularized logistic regression
Gradient descent
-
The change to the cost function and gradient descent:
-
Advanced optimization
Exercise by python
- Visualizing the data
path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_2/ex2data2.txt'
data2 = pd.read_csv(path2, header=None, names=['Microchip Test 1', 'Microchip Test 2', 'Decision'])
accepted = data2[data2['Decision'].isin([1])]
rejected = data2[data2['Decision'].isin([0])]
plt.figure()
accepted_plot = plt.scatter(x=accepted['Microchip Test 1'], y=accepted['Microchip Test 2'], color='k', marker='+')
rejected_plot = plt.scatter(x=rejected['Microchip Test 1'], y=rejected['Microchip Test 2'], color='y', marker='o')
plt.legend([accepted_plot, rejected_plot], ['Accepted', 'Rejected'])
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.show()
Output:
- Feature mapping
degree = 6
features = np.ones((data2['Microchip Test 1'].shape))
for i in range(0, degree + 1):
for j in range(0, degree + 1 - i):
# 0:0-6, 1:0-5, 2:0-4, 3:0-3, 4:0-2, 5:0-1, 6:0
features = np.c_[features, (np.power(data2['Microchip Test 1'], i) * np.power(data2['Microchip Test 2'], j))]
features = np.delete(features, 0, 1)
print(features.shape)
Output:
(118, 28) # 28 features
- Cost function and gradient
def Cost_Function_regularized(theta, x, y, lamda):
theta = np.matrix(theta)
x = np.matrix(x)
y = np.matrix(y)
first = - y.T * np.log(Sigmoid_Function(theta * x.T)).T
second = - (1 - y).T * np.log(1 - Sigmoid_Function(theta * x.T)).T
J = np.sum(first + second) / len(x) + (lamda / (2 * len(x))) * np.sum(np.power(theta, 2))
return J
def Gradient_Descent_regularized(theta, x, y, lamda):
theta = np.matrix(theta)
x = np.matrix(x)
y = np.matrix(y)
temp = np.matrix(np.zeros((theta.shape)))
error = Sigmoid_Function(theta * x.T) - y.T
for i in range(theta.shape[1]):
if i == 0:
temp[0, i] = (error * x[:, i]) / len(x)
else:
temp[0, i] = (error * x[:, i]) / len(x) + (lamda / len(x)) * theta[0, i]
theta_partial = temp
return theta_partial
- main function and get θ
if __name__ == '__main__':
# organize the data set
data2.insert(0, 'Ones', 1)
x2 = data2.loc[:, ['Microchip Test 1', 'Microchip Test 2']]
x2 = np.matrix(x2.values)
y2 = data2.loc[:, ['Decision']]
y2 = np.matrix(y2.values)
# initialize θ
theta_2 = np.matrix(np.zeros(features.shape[1]))
# initialize λ
lamda = 1
# compute the initial cost function
J = Cost_Function_regularized(theta_2, features, y2, lamda)
print('J2_initial is ', J)
# use gradient descent to find θ
result_2 = opt.fmin_tnc(func=Cost_Function_regularized, x0=theta_2, fprime=Gradient_Descent_regularized, args=(features, y2, lamda))
theta_2_get = result_2[0]
print('θ_regularized = ', theta_2_get)
Output:
θ_regularized = [ 1.2544148 1.1924276 -1.36184295 -0.17096365 -1.18096092 -0.4683615
-0.93023136 0.62276762 -0.87290719 -0.35603633 -0.25080285 -0.27658998
-0.12073387 -2.00505533 -0.35536846 -0.61498651 -0.27187034 -0.32631648
0.12573809 -0.06683335 -0.06382345 0.00581066 -1.45784668 -0.20562899
-0.29695283 -0.22566868 0.01627581 -1.03247378]
- Plotting the decision boundary
def mapFeature(x1, x2, degree):
z = np.matrix(np.ones(28))
c = 0
for i in range(degree + 1):
for j in range(degree + 1 - i):
z[0, c] = np.power(x1, i) * np.power(x2, j)
c = c + 1
return z
feature_x1 = np.linspace(-1, 1.5, 50)
feature_x2 = np.linspace(-1, 1.5, 50)
feature_z = np.matrix(np.zeros((len(feature_x1), len(feature_x2))))
for i in range(len(feature_x1)):
for j in range(len(feature_x2)):
feature_z[i, j] = np.matrix(theta_2_get) * mapFeature(feature_x1[i], feature_x2[j], degree).T
plt.figure()
accepted_plot = plt.scatter(x=accepted['Microchip Test 1'], y=accepted['Microchip Test 2'], color='k', marker='+')
rejected_plot = plt.scatter(x=rejected['Microchip Test 1'], y=rejected['Microchip Test 2'], color='y', marker='o')
plt.legend([accepted_plot, rejected_plot], ['Accepted', 'Rejected'])
plt.contour(feature_x1, feature_x2, feature_z, [0])
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.show()
Output:
Multiple-class classification exercise by python
- This exercise combines logistic regression, regularization, multiple-class
- read the prepared dataset
def read_data():
path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_3/ex3/ex3data1.mat'
data1 = loadmat(path1)
X = data1['X']
y = data1['y']
return X, y
- visiualize the dataset
def visuial_data(X, y):
rand_indices = np.random.choice(X.shape[0], 100)
# print(rand_indices.shape)
select = X[rand_indices, :]
# print(select.shape)
fig, ax = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True)
# print(ax.shape)
for r in range(10):
for c in range(10):
show = select[10*r+c, :].reshape((20, 20)).T
ax[r, c].matshow(show, cmap=cm.binary)
plt.xticks([])
plt.yticks([])
plt.show()
Output:
- vectorize the parameters in logistic regression 3.1 Sigmoid function
def sigmoidfunction(theta, X):
theta = np.matrix(theta)
X = np.matrix(X)
f = 1 / (1 + np.exp(- X * theta.T)) # (5000,401)*(401,1)=(5000,1)
return f
3.2 vectorize the cost function
def vector_costfunction(theta, X, y, learning_rate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
error = - y.T * np.log(sigmoidfunction(theta, X)) - (1 - y).T * np.log(1 - sigmoidfunction(theta, X)) # (1,5000)*(5000,1)=(1,1)
reg = (learning_rate / (2 * X.shape[0])) * np.sum(np.power(theta, 2))
J = np.sum(error) / X.shape[0] + reg # np.sum() here is used to transfer to 1 dimension
return J
3.3 vectorize the gradient descent
def vector_gradientdescent(theta, X, y, learning_rate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
partial = np.matrix(np.zeros((theta.shape)))
error = sigmoidfunction(theta, X) - y # (5000,1)
for i in range(X.shape[1]):
error_multiply = error.T * X[:, i] # (1,5000)*(5000,1)
partial_part = np.sum(error_multiply) / X.shape[0] # np.sum() here is used to transfer to 1 dimension
if i == 0:
partial[0, i] = partial_part
else:
reg_part = (learning_rate / X.shape[0]) * theta[0, i]
partial[0, i] = partial_part + reg_part
return partial.flatten().A[0]
- one vs all clasiification
def onevsall(X, y, num_labels, learning_rate):
X_new = np.c_[np.zeros((X.shape[0], 1)), X]
theta_new = np.matrix(np.zeros((num_labels, X_new.shape[1])))
# for each number, compute its related parameters θ(1 row)
for i in range(1, num_labels+1):
theta_initial = np.matrix(np.zeros(X_new.shape[1]))
y_i = np.matrix(np.zeros((y.shape[0], 1)))
for j in range(y.shape[0]):
if y[j, 0] == i:
y_i[j, 0] = 1
else:
y_i[j, 0] = 0
# y_i:(5000,1); X_new:(5000, 401), theta_initial:(1,401)
result = opt.minimize(fun=vector_costfunction, x0=theta_initial, args=(X_new, y_i, learning_rate), method='CG', jac=vector_gradientdescent)
theta_new[i-1, :] = result.x
return theta_new
- one vs all prediction
def predictionfunction(theta, X):
X_test = np.c_[np.zeros((X.shape[0], 1)), X]
# compute the possibility for each item
possibility = sigmoidfunction(theta, X_test) # (5000,401)*(401,10)=(5000,10)
prediction_compute = np.argmax(possibility, axis=1) + 1 # (5000,)
return prediction_compute
- main function
if __name__ == '__main__':
# number of labels
num_labels = 10 # label: 1-10
# read dataset form file
X, y = read_data()
# visialize dataset
visuial_data(X, y)
# train the logistic regression
learning_rate = 0.1
theta = onevsall(X, y, num_labels, learning_rate)
print('θ_optimum = ', theta)
# test for prediction
prediction = predictionfunction(theta, X)
# compute the accuracy
correct = np.matrix(np.zeros((y.shape[0], 1)))
for i in range(y.shape[0]):
if prediction[i] == y[i, 0]:
correct[i, 0] = 1
else:
correct[i, 0] = 0
accuracy = (np.sum(correct) / correct.shape[0]) * 100
print('accuracy is {0}%'.format(accuracy))
Output:
θ_optimum = [[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 7.01043611e-03
5.33357475e-08 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.31038512e-02
-1.44559918e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -2.65840583e-05
-3.40642117e-07 0.00000000e+00]
...
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -1.28736982e-02
1.35831831e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -3.44641313e-02
3.31666884e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -2.59495789e-04
6.17074441e-06 0.00000000e+00]]
accuracy is 95.88%