2. Logistic Regression - ZYL-Harry/Machine_Learning_study GitHub Wiki

Classification

representing method:
y∈{0,1}, where "0" refers to the "Negative Class", while "1" refers to the "Positive Class"

Start with classification problems with just two classes 0 and 1, later we'll talk about multi-class problems as well

Problem:

Therefore, the linear regression(hypohesis may >1 or <0 when the training data set are all 0 or 1, which is strange) isn't often used in classification problems, then, the logistic regression(0<hypothesis<1) is ofthen applied as a classification algorithm

Logistic Function

To let the output of the hypothesis is between 0 and 1, 0≤h(θ)≤1, we introduce a new function, Sigmoid Function(Logistic Function):

Then, the new hypothesis function:

Then, Given a training set, we need to pick a value for θ, and then use the hypothesis to make predictions

Interpretation of Hypothesis Output
In the classification problems, the output of hypothesis is the estimated probability that y=1 on input x
Mathematical representation:

Decision boundary(KEY)

Qusetion: what is the logistic regression hypothesis function is computing
Concept: a property of the hypothesis, we use the training data set to fit the parameter θ by cost function, but the decision boundary is not related to the training set
Function:
form a straight line, which separats the region where the hypothesis predicts "y=1" from where the hypothesis predicts "y=0"
Method:
Suppose predict "y=1" if "h(x)≥0.5" where x≥0, predict "y=0" if "h(x)≤0.5" where x≤0
So, in the hypothesis, "y=1" if h(θx)≥0.5, where θx≥0, "y=0" if h(θx)≤0, where θx≤0
Example:

Cost function

m training set:
n features:
Hypothesis:
Cost function:

To simplify the process of analuyze, we extract the main part:

But if we still use this kind of hypothesis function, we can only get a non-convex function:

Therefore, we need to come up with a new kind of hypothesis function
Logistic regression cost function:
The main part of the logistic regression cost function is in the following, which is come from the principle of maximum likelihood estimation:
Simplified cost function
The main part of the cost function is simplified into a more compressed one:

Then, the logistc regression cost function becomes

How to choose parameters θ?

To fit parameters θ, we still need to find the minimum of the cost function J(θ), the method we use is still the gradient descent, like doing in the linear regression

From the formula, we can find that it looks identical to the one in the linear regression, the difference is that the hypothesis functions are different

Tips:
1. We can still plot the cost function to make sure whether it works correctly(convergence) or not
2. We can still use feature scaling to let the gradient descent works more quickly
Other optimization algorithms:

Multi-class classification: One-vs-all

Method:

Train a logistic regression classifier for each class i to predict the probability that y=i
On a new input x to make a prediction, pick the class i that maximizes classifier among the classifiers, which is the most confident, or most enthusiacal

Exercise by python

Visualizing the data

# 1.1.1 get the data
path = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_2/ex2data1.txt'
data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admission'])
# 1.1.2 classify the data
positive = data[data.Admission.isin([1])]
negative = data[data.Admission.isin([0])]
# 1.1.3 plot the data
plt.figure()
pos = plt.scatter(x=positive['Exam 1'], y=positive['Exam 2'], color='k', marker='+')
neg = plt.scatter(x=negative['Exam 1'], y=negative['Exam 2'], color='y', marker='o')
plt.legend([pos, neg], ['Admitted', 'Unadmitted'])
plt.xlabel('Exam 1')
plt.ylabel('Exam 2')
plt.show()

Output:

Implementation 2.1 sigmoid function

def Sigmoid_Function(z):
    g = 1 / (1 + np.exp(-z))
    return g

2.2 cost function

def Cost_Function(theta, x, y):
    first = - y.T * np.log(Sigmoid_Function(theta * x.T)).T
    second = - (1 - y).T * np.log(1 - Sigmoid_Function(theta * x.T)).T
    J = (1 / len(x)) * np.sum(first + second)
    return J

2.3 gradient descent

Only compute the partial part

def Gradient_Descent2(theta, x, y):
    temp = np.matrix(np.zeros(theta.shape))
    error = Sigmoid_Function(theta * x.T) - y.T
    for i in range(temp.shape[1]):
        partial_part = error * x[:, i]
        temp[0, i] = (1 / len(x)) * np.sum(partial_part)
    theta_partial = temp
    return theta_partial

2.4 main function

if __name__ == '__main__':
    data.insert(0, 'Ones', 1)
    x = data.loc[:, ['Ones', 'Exam 1', 'Exam 2']]
    x = np.matrix(x.values)
    y = data.loc[:, ['Admission']]
    y = np.matrix(y.values)
    theta = np.matrix(np.zeros(3))
    J0 = Cost_Function(theta, x, y)

    # do gradient descent with established function
    result = opt.fmin_tnc(func=Cost_Function, x0=theta, fprime=Gradient_Descent2, args=(x, y))
    theta2 = result[0]
    print('θ_fmin_tnc = ', theta2)

Output:

θ_fmin_tnc = [-25.16131862 0.20623159 0.20147149]

opt.fmin_tnc is in the scipy.optimize library, which is used to do optimizing algorithms, whose method of application can be seen in the following url: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_tnc.html

2.5 prediction

information2 = np.matrix(np.array([1, 45, 85]))
prediction2 = Sigmoid_Function(theta2 * information2.T)
print('new prediction to [45, 85] is ', prediction2)

Output:

new prediction to [45, 85] is 0.77629062

2.6 plot the regression figure

xxx = np.linspace(30, 100, 100)
yyy = (theta2[0] + theta2[1] * xxx) / (-theta2[2])
plt.figure()
pos = plt.scatter(x=positive['Exam 1'], y=positive['Exam 2'], color='k', marker='+')
neg = plt.scatter(x=negative['Exam 1'], y=negative['Exam 2'], color='y', marker='o')
plt.legend([pos, neg], ['Admitted', 'Unadmitted'])
plt.plot(xxx, yyy, color='b')
plt.xlabel('Exam 1')
plt.ylabel('Exam 2')
plt.show()

Output:

2.7 check the accuracy

predictions = Sigmoid_Function(theta2 * x.T)
    predictions_classification = np.matrix(np.zeros(predictions.shape))
    for i in range(predictions.shape[1]):
        if predictions[0, i] >= 0.5:
            predictions_classification[0, i] = 1
        else:
            predictions_classification[0, i] = 0
    correct = np.matrix(np.zeros(predictions_classification.shape))
    j = 0
    for i in range(predictions.shape[1]):
        if ((predictions_classification[0, i] == 1 and y[i, 0] == 1) or (predictions_classification[0, i] == 0 and y[i, 0] == 0)):
            correct[0, j] = 1
        else:
            correct[0, j] = 0
        j = j + 1
    accuracy = (np.sum(correct, axis=1) / correct.shape[1]) * 100
    print('accuracy = {0}%'.format(accuracy[0, 0]))

Output:

accuracy = 89.0%

Overfitting Problem

Underfitting: the regression doesn't fit the data set very well---high bias
Overfitting: the regression may pass through the training set very well, but it may be a very wiggly curve, which goes up and down all over the place---high variance
Cause: If we are fitting a high order polynomial, then the hypothesis can fit almost any function, but this face of possible hypothesis is too large, and we don't have enough data to constrain it to give us a good hypothesis(a function of θ and x)
Addressing:

Regularization

Intuition:

Idea---Small values for parameters θ0, θ1, ..., θn can have simpler hypothesis and less prone to overfitting
Cost Function:

λ is the regularization parameter used to keep the balance between training the data set well and keeping the parameter θ small, and therefore keeping the hypothesis relatively simple to avoid overfitting
If λ is set to an extremely large value, then h_θ(x)=θ0

Regularized linear regression

Gradient descent

The change to the cost function and gradient descent:

Norm equation

The change to the cost function and the equation of θ:

If λ>0, then the matrix can be invertible even with large number of features and only a few training set

Regularized logistic regression

Gradient descent

The change to the cost function and gradient descent:
Advanced optimization

Exercise by python

Visualizing the data

path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_2/ex2data2.txt'
data2 = pd.read_csv(path2, header=None, names=['Microchip Test 1', 'Microchip Test 2', 'Decision'])
accepted = data2[data2['Decision'].isin([1])]
rejected = data2[data2['Decision'].isin([0])]
plt.figure()
accepted_plot = plt.scatter(x=accepted['Microchip Test 1'], y=accepted['Microchip Test 2'], color='k', marker='+')
rejected_plot = plt.scatter(x=rejected['Microchip Test 1'], y=rejected['Microchip Test 2'], color='y', marker='o')
plt.legend([accepted_plot, rejected_plot], ['Accepted', 'Rejected'])
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.show()

Output:

Feature mapping

degree = 6
features = np.ones((data2['Microchip Test 1'].shape))
for i in range(0, degree + 1):
    for j in range(0, degree + 1 - i):
        # 0:0-6, 1:0-5, 2:0-4, 3:0-3, 4:0-2, 5:0-1, 6:0
        features = np.c_[features, (np.power(data2['Microchip Test 1'], i) * np.power(data2['Microchip Test 2'], j))]
features = np.delete(features, 0, 1)
print(features.shape)

Output:

(118, 28) # 28 features

Cost function and gradient

def Cost_Function_regularized(theta, x, y, lamda):
    theta = np.matrix(theta)
    x = np.matrix(x)
    y = np.matrix(y)
    first = - y.T * np.log(Sigmoid_Function(theta * x.T)).T
    second = - (1 - y).T * np.log(1 - Sigmoid_Function(theta * x.T)).T
    J = np.sum(first + second) / len(x) + (lamda / (2 * len(x))) * np.sum(np.power(theta, 2))
    return J

def Gradient_Descent_regularized(theta, x, y, lamda):
    theta = np.matrix(theta)
    x = np.matrix(x)
    y = np.matrix(y)
    temp = np.matrix(np.zeros((theta.shape)))
    error = Sigmoid_Function(theta * x.T) - y.T
    for i in range(theta.shape[1]):
        if i == 0:
            temp[0, i] = (error * x[:, i]) / len(x)
        else:
            temp[0, i] = (error * x[:, i]) / len(x) + (lamda / len(x)) * theta[0, i]
        theta_partial = temp
    return theta_partial

main function and get θ

if __name__ == '__main__':
    # organize the data set
    data2.insert(0, 'Ones', 1)
    x2 = data2.loc[:, ['Microchip Test 1', 'Microchip Test 2']]
    x2 = np.matrix(x2.values)
    y2 = data2.loc[:, ['Decision']]
    y2 = np.matrix(y2.values)
    # initialize θ
    theta_2 = np.matrix(np.zeros(features.shape[1]))
    # initialize λ
    lamda = 1
    # compute the initial cost function
    J = Cost_Function_regularized(theta_2, features, y2, lamda)
    print('J2_initial is ', J)
    # use gradient descent to find θ
    result_2 = opt.fmin_tnc(func=Cost_Function_regularized, x0=theta_2, fprime=Gradient_Descent_regularized, args=(features, y2, lamda))
    theta_2_get = result_2[0]
    print('θ_regularized = ', theta_2_get)

Output:

θ_regularized = [ 1.2544148 1.1924276 -1.36184295 -0.17096365 -1.18096092 -0.4683615
-0.93023136 0.62276762 -0.87290719 -0.35603633 -0.25080285 -0.27658998
-0.12073387 -2.00505533 -0.35536846 -0.61498651 -0.27187034 -0.32631648
0.12573809 -0.06683335 -0.06382345 0.00581066 -1.45784668 -0.20562899
-0.29695283 -0.22566868 0.01627581 -1.03247378]

Plotting the decision boundary

def mapFeature(x1, x2, degree):
    z = np.matrix(np.ones(28))
    c = 0
    for i in range(degree + 1):
        for j in range(degree + 1 - i):
            z[0, c] = np.power(x1, i) * np.power(x2, j)
            c = c + 1
    return z

feature_x1 = np.linspace(-1, 1.5, 50)
feature_x2 = np.linspace(-1, 1.5, 50)
feature_z = np.matrix(np.zeros((len(feature_x1), len(feature_x2))))
for i in range(len(feature_x1)):
    for j in range(len(feature_x2)):
        feature_z[i, j] = np.matrix(theta_2_get) * mapFeature(feature_x1[i], feature_x2[j], degree).T
plt.figure()
accepted_plot = plt.scatter(x=accepted['Microchip Test 1'], y=accepted['Microchip Test 2'], color='k', marker='+')
rejected_plot = plt.scatter(x=rejected['Microchip Test 1'], y=rejected['Microchip Test 2'], color='y', marker='o')
plt.legend([accepted_plot, rejected_plot], ['Accepted', 'Rejected'])
plt.contour(feature_x1, feature_x2, feature_z, [0])
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.show()

Output:

Multiple-class classification exercise by python

This exercise combines logistic regression, regularization, multiple-class

read the prepared dataset

def read_data():
    path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_3/ex3/ex3data1.mat'
    data1 = loadmat(path1)
    X = data1['X']
    y = data1['y']
    return X, y

visiualize the dataset

def visuial_data(X, y):
    rand_indices = np.random.choice(X.shape[0], 100)
    # print(rand_indices.shape)
    select = X[rand_indices, :]
    # print(select.shape)
    fig, ax = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True)
    # print(ax.shape)
    for r in range(10):
        for c in range(10):
            show = select[10*r+c, :].reshape((20, 20)).T
            ax[r, c].matshow(show, cmap=cm.binary)
            plt.xticks([])
            plt.yticks([])
    plt.show()

Output:

vectorize the parameters in logistic regression 3.1 Sigmoid function

def sigmoidfunction(theta, X):
    theta = np.matrix(theta)
    X = np.matrix(X)
    f = 1 / (1 + np.exp(- X * theta.T)) # (5000,401)*(401,1)=(5000,1)
    return f

3.2 vectorize the cost function

def vector_costfunction(theta, X, y, learning_rate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    error = - y.T * np.log(sigmoidfunction(theta, X)) - (1 - y).T * np.log(1 - sigmoidfunction(theta, X))   # (1,5000)*(5000,1)=(1,1)
    reg = (learning_rate / (2 * X.shape[0])) * np.sum(np.power(theta, 2))
    J = np.sum(error) / X.shape[0] + reg  # np.sum() here is used to transfer to 1 dimension
    return J

3.3 vectorize the gradient descent

def vector_gradientdescent(theta, X, y, learning_rate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    partial = np.matrix(np.zeros((theta.shape)))
    error = sigmoidfunction(theta, X) - y   # (5000,1)
    for i in range(X.shape[1]):
        error_multiply = error.T * X[:, i]  # (1,5000)*(5000,1)
        partial_part = np.sum(error_multiply) / X.shape[0] # np.sum() here is used to transfer to 1 dimension
        if i == 0:
            partial[0, i] = partial_part
        else:
            reg_part = (learning_rate / X.shape[0]) * theta[0, i]
            partial[0, i] = partial_part + reg_part
    return partial.flatten().A[0]

one vs all clasiification

def onevsall(X, y, num_labels, learning_rate):
    X_new = np.c_[np.zeros((X.shape[0], 1)), X]
    theta_new = np.matrix(np.zeros((num_labels, X_new.shape[1])))
    # for each number, compute its related parameters θ(1 row)
    for i in range(1, num_labels+1):
        theta_initial = np.matrix(np.zeros(X_new.shape[1]))
        y_i = np.matrix(np.zeros((y.shape[0], 1)))
        for j in range(y.shape[0]):
            if y[j, 0] == i:
                y_i[j, 0] = 1
            else:
                y_i[j, 0] = 0
        # y_i:(5000,1); X_new:(5000, 401), theta_initial:(1,401)
        result = opt.minimize(fun=vector_costfunction, x0=theta_initial, args=(X_new, y_i, learning_rate), method='CG', jac=vector_gradientdescent)
        theta_new[i-1, :] = result.x
    return theta_new

one vs all prediction

def predictionfunction(theta, X):
    X_test = np.c_[np.zeros((X.shape[0], 1)), X]
    # compute the possibility for each item
    possibility = sigmoidfunction(theta, X_test) # (5000,401)*(401,10)=(5000,10)
    prediction_compute = np.argmax(possibility, axis=1) + 1 # (5000,)
    return prediction_compute

main function

if __name__ == '__main__':
    # number of labels
    num_labels = 10 # label: 1-10
    # read dataset form file
    X, y = read_data()
    # visialize dataset
    visuial_data(X, y)
    # train the logistic regression
    learning_rate = 0.1
    theta = onevsall(X, y, num_labels, learning_rate)
    print('θ_optimum = ', theta)
    # test for prediction
    prediction = predictionfunction(theta, X)
    # compute the accuracy
    correct = np.matrix(np.zeros((y.shape[0], 1)))
    for i in range(y.shape[0]):
        if prediction[i] == y[i, 0]:
            correct[i, 0] = 1
        else:
            correct[i, 0] = 0
    accuracy = (np.sum(correct) / correct.shape[0]) * 100
    print('accuracy is {0}%'.format(accuracy))

Output:

θ_optimum = [[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 7.01043611e-03
5.33357475e-08 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.31038512e-02
-1.44559918e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -2.65840583e-05
-3.40642117e-07 0.00000000e+00]
...
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -1.28736982e-02
1.35831831e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -3.44641313e-02
3.31666884e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -2.59495789e-04
6.17074441e-06 0.00000000e+00]]

accuracy is 95.88%