3. Neural Networks - ZYL-Harry/Machine_Learning_study GitHub Wiki

Neural Networks

Complex non-linear hypotheses

  • Problems: In the non-linear classification, if the number of features are more than two, overfitting the training set will happen possibly and it will also produce much cost
  • Method: use complex non-linear hypotheses

Introduction

Background

  • Origins: Algorithms that try to mimic the brain
  • Development: Was very widely used in 80s and early 90s; popularity diminished in late 90s
  • Recent resurgence: State-of-the-art technique for many applications

Concept

  • The "one learning algorithm" hypothesis: it can process sight or sound or touch, instead of needing to implement a thousand different programs to do thousand wonderful things that the brain does

  • Examples:
    image

  • Inspiration:
    image
    image
    image
    image
    Tip: the way to understand the dimension of θ is the way of the multiplication of two matrix(combinitions of the columns of θ)

  • Computation:
    Foward propagation: start off with the activations of the input-units, and then forward-propagate that to the hidden layer and compute the activations of the hidden layer, and then forward-propagate that and compute the activations of the output layer
    image

  • Features:
    1、Just to say that again, what the neural networks is doing is just like logistic regression, except that rather than using the original features x1, x2, x3, it is using the new features a1, a2, a3
    2、The features a1, a2, a3 are learned as functions of the input, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters θ. So, the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression, it gets to learn these new features a1, a2, a3 to feed into logistic regression
    3、As depending on what parameter chosen, we can have some interesting and complex features, and then get a better hypotheses than straightly to use the raw features x1, x2, x3, therefore, this algorithm has the flexibility to try to learn whatever features at once, using these a1, a2, a3 in order to feed into this last unit, that's essentially a logistic regression here

Examples

Show how the neural networks can learn complex non-linear hypotheses
image
image
image
image

Multi-class classification

Binary classification & Multi-class classification

  • Binary classification---Situation: one output unit, hypothesis will be a real number
  • Multi-class classification---Situation: K output units, hypotheses will be a K dimensional vector
    image

Cost function

image

Backpropagation algorithm

  • Forward propagation:
    image

  • Back propagation: image
    image

Gradient checking

  • Problem: the algorithm seems to work well, but the result is actually not very good compared to the bug-free algorithms
  • Method: check the gradient before training
    image
    image
  • When implementing back-propagation or similar gradient descent algorithms for complicated model, gradient checking is very necessary to make sure the code is correct

Random initialization

  • Problem:
    Initializing all of parameters θ to 0 doesn't work when training a neural network
    image
  • Method:
    image

To summarize, to train a neural network, we should first randomly initialize the wieights to small values close to 0 between -ε and ε, and then implement back-propagation, do gradient checking, and use either gradient descent or one of the advanced optimization algorithms to find the specific θ which can minimize J(θ)


Procedure

  1. Pick a network architecture(connectively pattern between neurons)
    No. of inpt units: Dimension of features x
    No. of output units: Number of classes y
    Reasonable default: 1 hidden layer, if >1 hidden layer, have same no. of hidden units in every layer(usually the more the better)

  2. Randomly initialize weights

  3. Implement forward propagation to get h(x) for any x

  4. Implement code to compute cost function J(θ)

  5. Implement back-propagation to compute partial derivatives

  6. Use gradient checking to compare partial derivatives computed using backpropagation vs. using numerical estimate of gradient of J(θ)
    Then disable gradient checking code

  7. Use gradient descent or advanced optimization method with backpropagation to try to minimize J(θ)(a non-convex function) as a function of parameters θ

Exercise by python

1. Neural Networks by the feedforward propagation algorithm

1.1 read the prepared dataset

def read_data():
    path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_3/ex3/ex3data1.mat'
    data1 = loadmat(path1)
    X = data1['X']
    y = data1['y']
    return X, y

1.2 visiualize the dataset

def visuial_data(X, y):
    rand_indices = np.random.choice(X.shape[0], 100)
    # print(rand_indices.shape)
    select = X[rand_indices, :]
    # print(select.shape)
    fig, ax = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True)
    # print(ax.shape)
    for r in range(10):
        for c in range(10):
            show = select[10*r+c, :].reshape((20, 20)).T
            ax[r, c].matshow(show, cmap=cm.binary)
            plt.xticks([])
            plt.yticks([])
    plt.show()

Output:
image

1.3 read the trained parameter θ

def read_weights():
    path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_3/ex3/ex3weights.mat'
    data2 = loadmat(path2)
    theta1 = data2['Theta1']
    theta2 = data2['Theta2']
    return theta1, theta2

1.4 feedforward propagation
1.4.1sigmoid function

def sigmoidfunction(theta, X):
    theta = np.matrix(theta)
    X = np.matrix(X)
    f = 1 / (1 + np.exp(- X * theta.T)) # (5000,401)*(401,25)=(5000,25); (5000,26)*(26,10)=(5000,10)
    return f

1.4.2 feedforward propagation classification

def classification(theta1, theta2, X):
    X_new = np.c_[np.matrix(np.ones((X.shape[0], 1))), X]
    a1 = X_new  # (5000,401)
    # z2 = theta1 * a1.T    # (25,401)*(401,5000)
    a2 = np.c_[np.matrix(np.ones((X.shape[0], 1))), sigmoidfunction(theta1, a1)]    # (5000,26)
    a3 = sigmoidfunction(theta2, a2)    #(5000,10)
    prediction_compute = np.argmax(a3, axis=1) + 1
    return prediction_compute

1.4.3 check the accuracy

def check_accuracy(y, prediction):
    correct = np.matrix(np.zeros((y.shape[0], 1)))
    for i in range(y.shape[0]):
        if prediction[i, 0] == y[i, 0]:
            correct[i, 0] = 1
        else:
            correct[i, 0] = 0
    accuracy_compute = (np.sum(correct) / correct.shape[0]) * 100
    return accuracy_compute

1.4.4 regularized cost function

def costfunction(theta1, theta2, a3, y, learning_rate):
    theta1 = np.matrix(theta1)
    theta2 = np.matrix(theta2)
    a3 = np.matrix(a3)
    y = np.matrix(y)
    y_vector = np.matrix(np.zeros((y.shape[0], 10)))    # transfer lables into 10-dimensional vectors
    for i in range(y_vector.shape[0]):
        y_vector[i, y[i, 0]-1] = 1
    first = np.sum(- np.multiply(y_vector, np.log(a3)) - np.multiply((1 - y_vector), np.log(1 - a3))) / a3.shape[0] # (5000,10)*(10,5000)=(5000,10)
    second = (learning_rate / (2 * a3.shape[0])) * (np.sum(np.power(theta1, 2)) + np.sum(np.power(theta2, 2)))
    J = first + second
    return J

1.5 main function

if __name__ == '__main__':
    # number of labels
    num_labels = 10 # label: 1-10
    # read dataset form file
    X, y = read_data()
    # visialize dataset
    visuial_data(X, y)
    # read the trained parameter θ
    theta1, theta2 = read_weights()
    # classification
    prediction = classification(theta1, theta2, X)
    # compute the accuracy
    accuracy = check_accuracy(y, prediction)
    print('accuracy is {0}%'.format(accuracy))
    # compute the cost function
    learning_rate = 1
    J = costfunction(theta1, theta2, a3, y, learning_rate)
    print(J)

Output:

accuracy is 97.52%
J = 0.384487796242894

2. Neural Networks by backpropagation

The codes below seem to be correct but the result is not satisfied, so the codes are just tidied up temporarily
2.1 read the dataset

def read_data():
    path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_4/ex4/ex4data1.mat'
    data1 = loadmat(path1)
    X = data1['X']
    y = data1['y']
    return X, y

2.2 visiualize the dataset

def visiualize_data(X):
    rand_indices = np.random.choice(X.shape[0], 100)
    X_select = X[rand_indices, :]
    fig, ax = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True)
    for r in range(10):
        for c in range(10):
            show = X_select[10*r+c, :].reshape((20, 20)).T
            ax[r, c].matshow(show, cmap=cm.binary)
            plt.xticks([])
            plt.yticks([])
    plt.show()

Output:
image

2.3 randomly initialize weights

def initialize_weights(input_units_size, hidden_units_size, output_units_size, epsilon):
    theta1_initial = np.matrix(np.random.random((hidden_units_size, (input_units_size + 1))) * 2 * epsilon - epsilon)    # (25,401)
    theta2_initial = np.matrix(np.random.random((output_units_size, (hidden_units_size + 1))) * 2 * epsilon - epsilon)    # (10,26)
    return theta1_initial, theta2_initial

2.4 implement forward propagation to get h(x) for any x

# 2.4.1 sigmoid function
def sigmoidfunction(theta, X):
    theta = np.matrix(theta)
    X = np.matrix(X)
    f = 1 / (1 + np.exp(- X * theta.T)) # (5000,401)*(401,25)=(5000,25), (5000,26)*(26,10)=(5000,10)
    return f
# 2.4.2 feedforward propagation
def feedforward_propagation(theta1, theta2, X):
    X_new = np.c_[np.matrix(np.ones((X.shape[0], 1))), X]
    a1 = X_new  # (5000,401)
    z2 = a1 * theta1.T  # (5000,401)*(401,25)=(5000,25)
    a2 = np.c_[np.matrix(np.ones((X.shape[0], 1))), sigmoidfunction(theta1, a1)]    # (5000,26)
    z3 = a2 * theta2.T  # (5000,26)*(26,10)=(5000,10)
    a3 = sigmoidfunction(theta2, a2)    # (5000,10)
    prediction_compute = np.argmax(a3, axis=1) + 1
    return a1, z2, a2, z3, a3, prediction_compute

2.5 compute cost function

def costfunction(theta1, theta2, a3, y, learning_rate):
    theta1 = np.matrix(theta1)
    theta2 = np.matrix(theta2)
    a3 = np.matrix(a3)
    y = np.matrix(y)
    y_vector = np.matrix(np.zeros((y.shape[0], 10)))    # transfer lables into 10-dimensional vectors
    for i in range(y_vector.shape[0]):
        y_vector[i, y[i, 0]-1] = 1
    first = np.sum(- np.multiply(y_vector, np.log(a3)) - np.multiply((1 - y_vector), np.log(1 - a3))) / a3.shape[0] # (5000,10)*(10,5000)=(5000,10)
    second = (learning_rate / (2 * a3.shape[0])) * (np.sum(np.power(theta1, 2)) + np.sum(np.power(theta2, 2)))
    J = first + second
    return J

2.6 implement backpropagation to compute partial derivatives

# 2.6.1 sigmoid gradient
def sigmoid_gradient(z):
    f = 1 / (1 + np.exp(- z))
    f_derivative = np.multiply(f, (1 - f))
    return f_derivative
# backpropagation for training the neural networks——a comprehensive function
# in the backpropagation, this function does all the jobs including computing the costfunction and the gradients
def backpropagation_minimize(thetas, input_units_size, hidden_units_size, output_units_size, X, y, learning_rate):
    y = np.matrix(y)
    # separate the thetas into theta1 and theta2
    theta1 = np.matrix(np.reshape(thetas[:(hidden_units_size * (input_units_size + 1))], (hidden_units_size, (input_units_size + 1))))
    theta2 = np.matrix(np.reshape(thetas[(hidden_units_size * (input_units_size + 1)):], (output_units_size, (hidden_units_size + 1))))
    # implement the forward propagation
    a1, z2, a2, z3, a3, prediction_compute = feedforward_propagation(theta1, theta2, X)
    # compute the costfunction
    J = costfunction(theta1, theta2, a3, y, learning_rate)
    # initialize the delta1 and delta2
    delta1 = np.matrix(np.zeros(theta1.shape))
    delta2 = np.matrix(np.zeros(theta2.shape))
    for t in range(y.shape[0]):
        delta3_s = a3[t, :] - y[t, :]  # (1,10)
        a1_t = a1[t, :]  # (1,401)
        z2_t = np.c_[np.matrix(np.zeros((1, 1))), z2[t, :]]  # (1,26)
        a2_t = a2[t, :]  # (1,26)
        delta2_s = np.multiply((theta2.T * delta3_s.T).T, sigmoid_gradient(z2_t))  # (1,26)
        delta2 = delta2 + delta3_s.T * a2_t  # (10,1)*(1,26)=(10,26)
        delta1 = delta1 + delta2_s[0, 1:].T * a1_t  # (25,1)*(1,401)=(25,401)
    D1 = delta1 / y.shape[0]  # (25,401)
    D2 = delta2 / y.shape[0]  # (10,26)
    # regularized neural networks
    D1[:, 1:] = D1[:, 1:] + ((learning_rate) / (y.shape[0])) * np.power(theta1[:, 1:], 2)
    D2[:, 1:] = D2[:, 1:] + ((learning_rate) / (y.shape[0])) * np.power(theta2[:, 1:], 2)
    D_matrix = np.r_[np.reshape(theta1_initial, ((theta1_initial.shape[0] * theta1_initial.shape[1]), 1)), np.reshape(theta2_initial, ((theta2_initial.shape[0] *theta2_initial.shape[1]), 1))]
    # print(D_matrix.shape)
    D = np.ravel(D_matrix)
    return J, D

2.7 gradient checking
I haven't completed this part of code

# 2.7.1 compute the numerical gradients
def computenumericalgradients(thetas):
    epsilon_check = 0.0001
    thetas_check = np.matrix(np.zeros(thetas.shape))
    numgradients = np.matrix(np.zeros(thetas.shape))
    for i in range(thetas.shape[0]):
        thetas_check[i, 1] = epsilon_check
        # J_plus and J_minus haven't been completed due to the different parameters with the costfunction
        J_plus = costfunction((thetas + thetas_check))
        J_minus = costfunction((thetas - thetas_check))
        numgradients[i, 1] = (J_plus - J_minus) / (2 * epsilon_check)
        thetas_check[i, 1] = 0
    return numgradients
# 2.7.2 compute the difference
def checkgradients(theta1, theta2, D1, D2):
    thetas = np.c_[theta1[:], theta2[:]]
    gradients_backpropagation = np.c_[D1[:], D2[:]]
    numgradients = computenumericalgradients(thetas)
    difference = gradients_backpropagation - numgradients
    return difference

2.8 gradient descent to find the optimum parameter θ

def training_function(thetas, input_units_size, hidden_units_size, output_units_size, X, y, learning_rate):
    # result = opt.minimize(fun=backpropagation_minimize, x0=(thetas), args=(input_units_size, hidden_units_size, output_units_size, X, y, learning_rate), method='CG', jac=True, options={'maxiter': 250})
    result = opt.minimize(fun=backpropagation_minimize, x0=(thetas),args=(input_units_size, hidden_units_size, output_units_size, X, y, learning_rate), method='CG', jac=True, options={'maxiter': 1000})
    print(result)
    theta = result.x
    theta1 = np.matrix(np.reshape(theta[:(hidden_units_size * (input_units_size + 1))], (hidden_units_size, (input_units_size + 1))))
    theta2 = np.matrix(np.reshape(theta[(hidden_units_size * (input_units_size + 1)):], (output_units_size, (hidden_units_size + 1))))
    return theta1, theta2

2.9 test for prediction

def prediction_function(theta1, theta2, X):
    a1, z2, a2, z3, a3, prediction_test = feedforward_propagation(theta1, theta2, X)
    return prediction_test

2.10 main function

if __name__ == '__main__':
    # number of labels
    num_labels = 10
    # read the dataset
    X, y = read_data()
    # visiualize the dataset
    visiualize_data(X)
    # randomly initialize weights
    input_units_size = 400
    hidden_units_size = 25
    output_units_size = 10
    epsilon = 0.12
    theta1_initial, theta2_initial = initialize_weights(input_units_size, hidden_units_size, output_units_size, epsilon)
    # implement forward propagation
    a1_initial, z2_initial, a2_initial, z3_initial, a3_initial, prediction_initial = feedforward_propagation(theta1_initial, theta2_initial, X)
    # compute the cost function
    learning_rate = 1
    J_initial = costfunction(theta1_initial, theta2_initial, a3_initial, y, learning_rate)
    print('J_initial = ', J_initial)
    thetas = np.r_[np.reshape(theta1_initial, ((theta1_initial.shape[0] * theta1_initial.shape[1]), 1)), np.reshape(theta2_initial, ((theta2_initial.shape[0] *theta2_initial.shape[1]), 1))]
    # gradient descent to find the optimum parameter θ
    theta1, theta2 = training_function(thetas, input_units_size, hidden_units_size, output_units_size, X, y, learning_rate)
    print('θ_1 = ', theta1)
    print('θ_2 = ', theta2)
    # test for prediction
    prediction_test = prediction_function(theta1, theta2, X)
    # compute the accuracy
    print(classification_report(y, prediction_test))
    correct = np.matrix(np.zeros((y.shape[0], 1)))
    for i in range(y.shape[0]):
        if prediction_test[i] == y[i, 0]:
            correct[i, 0] = 1
        else:
            correct[i, 0] = 0
    accuracy = (np.sum(correct) / correct.shape[0]) * 100
    print('accuracy is {0}%'.format(accuracy))

Output:

θ_1 = [[ 0.00093676 0.0327831 -0.08774771 ... 0.01469842 -0.11366838
0.04732772]
[ 0.00541461 0.00313528 -0.01978898 ... -0.06623027 -0.03099227
-0.0627574 ]
[-0.09962305 -0.03641729 0.07920927 ... -0.02946693 -0.09628543
-0.03908287]
...
[-0.06869626 0.06521359 0.00053157 ... 0.00708912 -0.0402725
0.04879262]
[ 0.09867833 0.04986986 0.07670727 ... 0.04147549 0.09081925
0.11860844]
[ 0.03973169 -0.08732779 0.00238041 ... 0.11474445 -0.01584497
-0.02824676]]
θ_2 = [[-0.0751349 0.01220147 -0.00187009 0.10926564 0.1157885 0.09799796
-0.05254812 -0.02862972 0.08970652 -0.0280278 -0.1128644 -0.03723926
-0.11910176 -0.1099583 0.11375186 0.11434061 0.09047654 0.10998851
0.03165789 -0.03080743 0.06411663 -0.07459747 -0.10853152 0.09204101
0.11070508 -0.11126005]
[ 0.03740018 -0.03794814 -0.11878201 0.09821531 0.03147092 0.05811775
0.11927267 0.08902646 -0.02155084 0.00913861 0.01274126 -0.07575387
0.02722365 0.07587161 0.07568325 -0.11079793 -0.0898476 0.09887415
-0.04784729 -0.07902443 -0.00064729 -0.05502922 -0.04877157 0.00964495
-0.09990133 -0.04410787]
[-0.09276947 -0.02575454 -0.05379186 0.01018542 -0.02337626 -0.05306278
0.08630749 0.10712696 -0.05829341 0.09117723 0.03219847 0.09626431
-0.06929835 -0.08147247 -0.06141049 -0.05800682 -0.0667897 -0.11884382
0.05850218 -0.04898403 0.01673019 -0.0828654 0.05606018 -0.06753032
0.11130325 0.1120622 ]
[ 0.06434803 -0.0461836 0.10898068 -0.03483853 -0.0786496 0.06239861
-0.08276314 -0.10419294 0.01761108 0.11485193 -0.10611348 0.02844327
0.01238147 -0.11832113 -0.00106211 0.0896159 0.00494072 0.11194076
0.11991444 0.07693659 0.07436995 0.10781727 0.07065063 0.08087252
-0.08475401 0.04492129]
[ 0.00617256 -0.10020298 0.03600673 -0.00121347 0.01537988 0.08469746
0.11490091 0.01088844 -0.11466824 -0.11164414 -0.09614668 0.00844582
0.01413117 -0.01783809 -0.05669911 -0.03689826 0.0185126 0.08850692
0.06920852 0.05764374 -0.1097547 -0.03367873 0.05408404 -0.1171858
0.04675774 -0.07240418]
[ 0.05956696 -0.01751503 -0.05076935 0.00227092 0.09314919 -0.08683572
-0.00115794 -0.01464717 -0.08990213 -0.1091756 0.10848487 -0.00037829
-0.03633219 -0.01343996 0.0047237 -0.06841683 -0.00717001 0.00165824
-0.05505802 -0.04723638 0.07363739 0.00664931 -0.04622215 0.10349759
0.11233158 0.07261834]
[ 0.02088182 0.08646296 0.09306419 -0.08968333 -0.08750483 -0.04287509
0.08689379 0.0216704 -0.10112271 -0.06484703 -0.10443928 0.08688154
0.00647283 0.00514767 0.03129209 -0.04528363 -0.02947967 -0.09140532
-0.03190937 0.01607406 -0.11620516 0.11551845 -0.05828492 0.09838211
0.08790569 -0.01507423]
[ 0.00880364 0.03918421 -0.11664108 -0.01804836 0.00962997 -0.04744811
0.10806898 0.03297559 -0.11835414 -0.09431796 -0.10174431 0.05553892
-0.08492161 -0.02644175 0.04786368 0.01133686 -0.00406577 -0.10515024
0.10377942 -0.03291552 -0.05442823 0.0770617 0.03870792 -0.04150963
0.03215438 -0.05773026]
[-0.01929414 -0.00929312 -0.08156575 -0.07252875 0.02766874 -0.05009246
0.09333048 0.01501196 0.09234671 0.08167456 -0.04593689 -0.00252267
-0.08298099 0.00891856 0.05796965 0.06729885 -0.10021811 -0.07282444
0.01665613 -0.09640974 -0.07075257 0.05663896 0.00590476 -0.06053229
-0.023461 -0.11240952]
[-0.03532614 0.02855168 0.00041987 -0.07064003 -0.11732709 0.11933669
-0.01733265 -0.10903647 -0.0939109 0.00279991 0.09956265 0.02014653
0.05208287 -0.0077789 0.04401294 0.08964575 -0.08900355 -0.0125574
0.04025017 0.11615695 0.0341768 -0.01047185 -0.06442586 0.02192697
0.08802504 0.07518197]]

          precision    recall  f1-score   support
       1       0.00      0.00      0.00       500
       2       0.00      0.00      0.00       500
       3       0.00      0.00      0.00       500
       4       0.10      1.00      0.18       500
       5       0.00      0.00      0.00       500
       6       0.00      0.00      0.00       500
       7       0.00      0.00      0.00       500
       8       0.00      0.00      0.00       500
       9       0.00      0.00      0.00       500
      10       0.00      0.00      0.00       500
accuracy                           0.10      5000

macro avg 0.01 0.10 0.02 5000 weighted avg 0.01 0.10 0.02 5000

accuracy is 10.0%