2. Logistic Regression - ZYL-Harry/Machine_Learning_study GitHub Wiki

Classification

image

  • representing method:
    y∈{0,1}, where "0" refers to the "Negative Class", while "1" refers to the "Positive Class"

Start with classification problems with just two classes 0 and 1, later we'll talk about multi-class problems as well

  • Problem:
    image
    Therefore, the linear regression(hypohesis may >1 or <0 when the training data set are all 0 or 1, which is strange) isn't often used in classification problems, then, the logistic regression(0<hypothesis<1) is ofthen applied as a classification algorithm

Logistic Function

To let the output of the hypothesis is between 0 and 1, 0≤h(θ)≤1, we introduce a new function, Sigmoid Function(Logistic Function):
image
Then, the new hypothesis function:
image
image
Then, Given a training set, we need to pick a value for θ, and then use the hypothesis to make predictions

  • Interpretation of Hypothesis Output
    In the classification problems, the output of hypothesis is the estimated probability that y=1 on input x
    Mathematical representation:
    image

Decision boundary(KEY)

  • Qusetion: what is the logistic regression hypothesis function is computing
  • Concept: a property of the hypothesis, we use the training data set to fit the parameter θ by cost function, but the decision boundary is not related to the training set
  • Function:
    form a straight line, which separats the region where the hypothesis predicts "y=1" from where the hypothesis predicts "y=0"
  • Method:
    Suppose predict "y=1" if "h(x)≥0.5" where x≥0, predict "y=0" if "h(x)≤0.5" where x≤0
    So, in the hypothesis, "y=1" if h(θx)≥0.5, where θx≥0, "y=0" if h(θx)≤0, where θx≤0
  • Example:
    image
    image

Cost function

  • m training set:
    image

  • n features:
    image

  • Hypothesis:
    image

  • Cost function:
    image
    To simplify the process of analuyze, we extract the main part:
    image
    But if we still use this kind of hypothesis function, we can only get a non-convex function:
    image
    Therefore, we need to come up with a new kind of hypothesis function

  • Logistic regression cost function:
    The main part of the logistic regression cost function is in the following, which is come from the principle of maximum likelihood estimation:
    image
    image
    image
    image

  • Simplified cost function
    The main part of the cost function is simplified into a more compressed one:
    image
    Then, the logistc regression cost function becomes
    image

How to choose parameters θ?

To fit parameters θ, we still need to find the minimum of the cost function J(θ), the method we use is still the gradient descent, like doing in the linear regression
image
From the formula, we can find that it looks identical to the one in the linear regression, the difference is that the hypothesis functions are different

  • Tips:
    1. We can still plot the cost function to make sure whether it works correctly(convergence) or not
    2. We can still use feature scaling to let the gradient descent works more quickly

  • Other optimization algorithms:
    image

Multi-class classification: One-vs-all

image

  • Method:
  1. Train a logistic regression classifier for each class i to predict the probability that y=i
  2. On a new input x to make a prediction, pick the class i that maximizes classifier among the classifiers, which is the most confident, or most enthusiacal
    image

Exercise by python

  1. Visualizing the data
# 1.1.1 get the data
path = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_2/ex2data1.txt'
data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admission'])
# 1.1.2 classify the data
positive = data[data.Admission.isin([1])]
negative = data[data.Admission.isin([0])]
# 1.1.3 plot the data
plt.figure()
pos = plt.scatter(x=positive['Exam 1'], y=positive['Exam 2'], color='k', marker='+')
neg = plt.scatter(x=negative['Exam 1'], y=negative['Exam 2'], color='y', marker='o')
plt.legend([pos, neg], ['Admitted', 'Unadmitted'])
plt.xlabel('Exam 1')
plt.ylabel('Exam 2')
plt.show()

Output:
image

  1. Implementation 2.1 sigmoid function
def Sigmoid_Function(z):
    g = 1 / (1 + np.exp(-z))
    return g

2.2 cost function

def Cost_Function(theta, x, y):
    first = - y.T * np.log(Sigmoid_Function(theta * x.T)).T
    second = - (1 - y).T * np.log(1 - Sigmoid_Function(theta * x.T)).T
    J = (1 / len(x)) * np.sum(first + second)
    return J

2.3 gradient descent

  • Only compute the partial part
def Gradient_Descent2(theta, x, y):
    temp = np.matrix(np.zeros(theta.shape))
    error = Sigmoid_Function(theta * x.T) - y.T
    for i in range(temp.shape[1]):
        partial_part = error * x[:, i]
        temp[0, i] = (1 / len(x)) * np.sum(partial_part)
    theta_partial = temp
    return theta_partial

2.4 main function

if __name__ == '__main__':
    data.insert(0, 'Ones', 1)
    x = data.loc[:, ['Ones', 'Exam 1', 'Exam 2']]
    x = np.matrix(x.values)
    y = data.loc[:, ['Admission']]
    y = np.matrix(y.values)
    theta = np.matrix(np.zeros(3))
    J0 = Cost_Function(theta, x, y)

    # do gradient descent with established function
    result = opt.fmin_tnc(func=Cost_Function, x0=theta, fprime=Gradient_Descent2, args=(x, y))
    theta2 = result[0]
    print('θ_fmin_tnc = ', theta2)

Output:

θ_fmin_tnc = [-25.16131862 0.20623159 0.20147149]

opt.fmin_tnc is in the scipy.optimize library, which is used to do optimizing algorithms, whose method of application can be seen in the following url: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_tnc.html

2.5 prediction

information2 = np.matrix(np.array([1, 45, 85]))
prediction2 = Sigmoid_Function(theta2 * information2.T)
print('new prediction to [45, 85] is ', prediction2)

Output:

new prediction to [45, 85] is 0.77629062

2.6 plot the regression figure

xxx = np.linspace(30, 100, 100)
yyy = (theta2[0] + theta2[1] * xxx) / (-theta2[2])
plt.figure()
pos = plt.scatter(x=positive['Exam 1'], y=positive['Exam 2'], color='k', marker='+')
neg = plt.scatter(x=negative['Exam 1'], y=negative['Exam 2'], color='y', marker='o')
plt.legend([pos, neg], ['Admitted', 'Unadmitted'])
plt.plot(xxx, yyy, color='b')
plt.xlabel('Exam 1')
plt.ylabel('Exam 2')
plt.show()

Output:
image

2.7 check the accuracy

predictions = Sigmoid_Function(theta2 * x.T)
    predictions_classification = np.matrix(np.zeros(predictions.shape))
    for i in range(predictions.shape[1]):
        if predictions[0, i] >= 0.5:
            predictions_classification[0, i] = 1
        else:
            predictions_classification[0, i] = 0
    correct = np.matrix(np.zeros(predictions_classification.shape))
    j = 0
    for i in range(predictions.shape[1]):
        if ((predictions_classification[0, i] == 1 and y[i, 0] == 1) or (predictions_classification[0, i] == 0 and y[i, 0] == 0)):
            correct[0, j] = 1
        else:
            correct[0, j] = 0
        j = j + 1
    accuracy = (np.sum(correct, axis=1) / correct.shape[1]) * 100
    print('accuracy = {0}%'.format(accuracy[0, 0]))

Output:

accuracy = 89.0%

Overfitting Problem

  • Underfitting: the regression doesn't fit the data set very well---high bias
  • Overfitting: the regression may pass through the training set very well, but it may be a very wiggly curve, which goes up and down all over the place---high variance
    image
    image
    image
  • Cause: If we are fitting a high order polynomial, then the hypothesis can fit almost any function, but this face of possible hypothesis is too large, and we don't have enough data to constrain it to give us a good hypothesis(a function of θ and x)
  • Addressing:
    image

Regularization

  • Intuition:
    image
    Idea---Small values for parameters θ0, θ1, ..., θn can have simpler hypothesis and less prone to overfitting
  • Cost Function:
    image
    λ is the regularization parameter used to keep the balance between training the data set well and keeping the parameter θ small, and therefore keeping the hypothesis relatively simple to avoid overfitting
    If λ is set to an extremely large value, then h_θ(x)=θ0

Regularized linear regression

Gradient descent

  • The change to the cost function and gradient descent:
    image
    image

Norm equation

  • The change to the cost function and the equation of θ:
    image
    image
    If λ>0, then the matrix can be invertible even with large number of features and only a few training set

Regularized logistic regression

Gradient descent

  • The change to the cost function and gradient descent:
    image
    image

  • Advanced optimization
    image

Exercise by python

  1. Visualizing the data
path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_2/ex2data2.txt'
data2 = pd.read_csv(path2, header=None, names=['Microchip Test 1', 'Microchip Test 2', 'Decision'])
accepted = data2[data2['Decision'].isin([1])]
rejected = data2[data2['Decision'].isin([0])]
plt.figure()
accepted_plot = plt.scatter(x=accepted['Microchip Test 1'], y=accepted['Microchip Test 2'], color='k', marker='+')
rejected_plot = plt.scatter(x=rejected['Microchip Test 1'], y=rejected['Microchip Test 2'], color='y', marker='o')
plt.legend([accepted_plot, rejected_plot], ['Accepted', 'Rejected'])
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.show()

Output:
image

  1. Feature mapping
degree = 6
features = np.ones((data2['Microchip Test 1'].shape))
for i in range(0, degree + 1):
    for j in range(0, degree + 1 - i):
        # 0:0-6, 1:0-5, 2:0-4, 3:0-3, 4:0-2, 5:0-1, 6:0
        features = np.c_[features, (np.power(data2['Microchip Test 1'], i) * np.power(data2['Microchip Test 2'], j))]
features = np.delete(features, 0, 1)
print(features.shape)

Output:

(118, 28) # 28 features

  1. Cost function and gradient
def Cost_Function_regularized(theta, x, y, lamda):
    theta = np.matrix(theta)
    x = np.matrix(x)
    y = np.matrix(y)
    first = - y.T * np.log(Sigmoid_Function(theta * x.T)).T
    second = - (1 - y).T * np.log(1 - Sigmoid_Function(theta * x.T)).T
    J = np.sum(first + second) / len(x) + (lamda / (2 * len(x))) * np.sum(np.power(theta, 2))
    return J

def Gradient_Descent_regularized(theta, x, y, lamda):
    theta = np.matrix(theta)
    x = np.matrix(x)
    y = np.matrix(y)
    temp = np.matrix(np.zeros((theta.shape)))
    error = Sigmoid_Function(theta * x.T) - y.T
    for i in range(theta.shape[1]):
        if i == 0:
            temp[0, i] = (error * x[:, i]) / len(x)
        else:
            temp[0, i] = (error * x[:, i]) / len(x) + (lamda / len(x)) * theta[0, i]
        theta_partial = temp
    return theta_partial
  1. main function and get θ
if __name__ == '__main__':
    # organize the data set
    data2.insert(0, 'Ones', 1)
    x2 = data2.loc[:, ['Microchip Test 1', 'Microchip Test 2']]
    x2 = np.matrix(x2.values)
    y2 = data2.loc[:, ['Decision']]
    y2 = np.matrix(y2.values)
    # initialize θ
    theta_2 = np.matrix(np.zeros(features.shape[1]))
    # initialize λ
    lamda = 1
    # compute the initial cost function
    J = Cost_Function_regularized(theta_2, features, y2, lamda)
    print('J2_initial is ', J)
    # use gradient descent to find θ
    result_2 = opt.fmin_tnc(func=Cost_Function_regularized, x0=theta_2, fprime=Gradient_Descent_regularized, args=(features, y2, lamda))
    theta_2_get = result_2[0]
    print('θ_regularized = ', theta_2_get)

Output:

θ_regularized = [ 1.2544148 1.1924276 -1.36184295 -0.17096365 -1.18096092 -0.4683615
-0.93023136 0.62276762 -0.87290719 -0.35603633 -0.25080285 -0.27658998
-0.12073387 -2.00505533 -0.35536846 -0.61498651 -0.27187034 -0.32631648
0.12573809 -0.06683335 -0.06382345 0.00581066 -1.45784668 -0.20562899
-0.29695283 -0.22566868 0.01627581 -1.03247378]

  1. Plotting the decision boundary
def mapFeature(x1, x2, degree):
    z = np.matrix(np.ones(28))
    c = 0
    for i in range(degree + 1):
        for j in range(degree + 1 - i):
            z[0, c] = np.power(x1, i) * np.power(x2, j)
            c = c + 1
    return z
feature_x1 = np.linspace(-1, 1.5, 50)
feature_x2 = np.linspace(-1, 1.5, 50)
feature_z = np.matrix(np.zeros((len(feature_x1), len(feature_x2))))
for i in range(len(feature_x1)):
    for j in range(len(feature_x2)):
        feature_z[i, j] = np.matrix(theta_2_get) * mapFeature(feature_x1[i], feature_x2[j], degree).T
plt.figure()
accepted_plot = plt.scatter(x=accepted['Microchip Test 1'], y=accepted['Microchip Test 2'], color='k', marker='+')
rejected_plot = plt.scatter(x=rejected['Microchip Test 1'], y=rejected['Microchip Test 2'], color='y', marker='o')
plt.legend([accepted_plot, rejected_plot], ['Accepted', 'Rejected'])
plt.contour(feature_x1, feature_x2, feature_z, [0])
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.show()

Output:
image

Multiple-class classification exercise by python

  • This exercise combines logistic regression, regularization, multiple-class
  1. read the prepared dataset
def read_data():
    path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_3/ex3/ex3data1.mat'
    data1 = loadmat(path1)
    X = data1['X']
    y = data1['y']
    return X, y
  1. visiualize the dataset
def visuial_data(X, y):
    rand_indices = np.random.choice(X.shape[0], 100)
    # print(rand_indices.shape)
    select = X[rand_indices, :]
    # print(select.shape)
    fig, ax = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True)
    # print(ax.shape)
    for r in range(10):
        for c in range(10):
            show = select[10*r+c, :].reshape((20, 20)).T
            ax[r, c].matshow(show, cmap=cm.binary)
            plt.xticks([])
            plt.yticks([])
    plt.show()

Output:
image

  1. vectorize the parameters in logistic regression 3.1 Sigmoid function
def sigmoidfunction(theta, X):
    theta = np.matrix(theta)
    X = np.matrix(X)
    f = 1 / (1 + np.exp(- X * theta.T)) # (5000,401)*(401,1)=(5000,1)
    return f

3.2 vectorize the cost function

def vector_costfunction(theta, X, y, learning_rate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    error = - y.T * np.log(sigmoidfunction(theta, X)) - (1 - y).T * np.log(1 - sigmoidfunction(theta, X))   # (1,5000)*(5000,1)=(1,1)
    reg = (learning_rate / (2 * X.shape[0])) * np.sum(np.power(theta, 2))
    J = np.sum(error) / X.shape[0] + reg  # np.sum() here is used to transfer to 1 dimension
    return J

3.3 vectorize the gradient descent

def vector_gradientdescent(theta, X, y, learning_rate):
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    partial = np.matrix(np.zeros((theta.shape)))
    error = sigmoidfunction(theta, X) - y   # (5000,1)
    for i in range(X.shape[1]):
        error_multiply = error.T * X[:, i]  # (1,5000)*(5000,1)
        partial_part = np.sum(error_multiply) / X.shape[0] # np.sum() here is used to transfer to 1 dimension
        if i == 0:
            partial[0, i] = partial_part
        else:
            reg_part = (learning_rate / X.shape[0]) * theta[0, i]
            partial[0, i] = partial_part + reg_part
    return partial.flatten().A[0]
  1. one vs all clasiification
def onevsall(X, y, num_labels, learning_rate):
    X_new = np.c_[np.zeros((X.shape[0], 1)), X]
    theta_new = np.matrix(np.zeros((num_labels, X_new.shape[1])))
    # for each number, compute its related parameters θ(1 row)
    for i in range(1, num_labels+1):
        theta_initial = np.matrix(np.zeros(X_new.shape[1]))
        y_i = np.matrix(np.zeros((y.shape[0], 1)))
        for j in range(y.shape[0]):
            if y[j, 0] == i:
                y_i[j, 0] = 1
            else:
                y_i[j, 0] = 0
        # y_i:(5000,1); X_new:(5000, 401), theta_initial:(1,401)
        result = opt.minimize(fun=vector_costfunction, x0=theta_initial, args=(X_new, y_i, learning_rate), method='CG', jac=vector_gradientdescent)
        theta_new[i-1, :] = result.x
    return theta_new
  1. one vs all prediction
def predictionfunction(theta, X):
    X_test = np.c_[np.zeros((X.shape[0], 1)), X]
    # compute the possibility for each item
    possibility = sigmoidfunction(theta, X_test) # (5000,401)*(401,10)=(5000,10)
    prediction_compute = np.argmax(possibility, axis=1) + 1 # (5000,)
    return prediction_compute
  1. main function
if __name__ == '__main__':
    # number of labels
    num_labels = 10 # label: 1-10
    # read dataset form file
    X, y = read_data()
    # visialize dataset
    visuial_data(X, y)
    # train the logistic regression
    learning_rate = 0.1
    theta = onevsall(X, y, num_labels, learning_rate)
    print('θ_optimum = ', theta)
    # test for prediction
    prediction = predictionfunction(theta, X)
    # compute the accuracy
    correct = np.matrix(np.zeros((y.shape[0], 1)))
    for i in range(y.shape[0]):
        if prediction[i] == y[i, 0]:
            correct[i, 0] = 1
        else:
            correct[i, 0] = 0
    accuracy = (np.sum(correct) / correct.shape[0]) * 100
    print('accuracy is {0}%'.format(accuracy))

Output:

θ_optimum = [[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 7.01043611e-03
5.33357475e-08 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.31038512e-02
-1.44559918e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -2.65840583e-05
-3.40642117e-07 0.00000000e+00]
...
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -1.28736982e-02
1.35831831e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -3.44641313e-02
3.31666884e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -2.59495789e-04
6.17074441e-06 0.00000000e+00]]

accuracy is 95.88%