3. Neural Networks - ZYL-Harry/Machine_Learning_study GitHub Wiki
Neural Networks
Complex non-linear hypotheses
- Problems: In the non-linear classification, if the number of features are more than two, overfitting the training set will happen possibly and it will also produce much cost
- Method: use complex non-linear hypotheses
Introduction
Background
- Origins: Algorithms that try to mimic the brain
- Development: Was very widely used in 80s and early 90s; popularity diminished in late 90s
- Recent resurgence: State-of-the-art technique for many applications
Concept
-
The "one learning algorithm" hypothesis: it can process sight or sound or touch, instead of needing to implement a thousand different programs to do thousand wonderful things that the brain does
-
Examples:
-
Inspiration:
Tip: the way to understand the dimension of θ is the way of the multiplication of two matrix(combinitions of the columns of θ) -
Computation:
Foward propagation: start off with the activations of the input-units, and then forward-propagate that to the hidden layer and compute the activations of the hidden layer, and then forward-propagate that and compute the activations of the output layer
-
Features:
1、Just to say that again, what the neural networks is doing is just like logistic regression, except that rather than using the original features x1, x2, x3, it is using the new features a1, a2, a3
2、The features a1, a2, a3 are learned as functions of the input, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters θ. So, the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression, it gets to learn these new features a1, a2, a3 to feed into logistic regression
3、As depending on what parameter chosen, we can have some interesting and complex features, and then get a better hypotheses than straightly to use the raw features x1, x2, x3, therefore, this algorithm has the flexibility to try to learn whatever features at once, using these a1, a2, a3 in order to feed into this last unit, that's essentially a logistic regression here
Examples
Show how the neural networks can learn complex non-linear hypotheses
Multi-class classification
Binary classification & Multi-class classification
- Binary classification---Situation: one output unit, hypothesis will be a real number
- Multi-class classification---Situation: K output units, hypotheses will be a K dimensional vector
Cost function
Backpropagation algorithm
-
Forward propagation:
-
Back propagation:
Gradient checking
- Problem: the algorithm seems to work well, but the result is actually not very good compared to the bug-free algorithms
- Method: check the gradient before training
- When implementing back-propagation or similar gradient descent algorithms for complicated model, gradient checking is very necessary to make sure the code is correct
Random initialization
- Problem:
Initializing all of parameters θ to 0 doesn't work when training a neural network
- Method:
To summarize, to train a neural network, we should first randomly initialize the wieights to small values close to 0 between -ε and ε, and then implement back-propagation, do gradient checking, and use either gradient descent or one of the advanced optimization algorithms to find the specific θ which can minimize J(θ)
Procedure
-
Pick a network architecture(connectively pattern between neurons)
①No. of inpt units: Dimension of features x
②No. of output units: Number of classes y
③Reasonable default: 1 hidden layer, if >1 hidden layer, have same no. of hidden units in every layer(usually the more the better) -
Randomly initialize weights
-
Implement forward propagation to get h(x) for any x
-
Implement code to compute cost function J(θ)
-
Implement back-propagation to compute partial derivatives
-
Use gradient checking to compare partial derivatives computed using backpropagation vs. using numerical estimate of gradient of J(θ)
Then disable gradient checking code -
Use gradient descent or advanced optimization method with backpropagation to try to minimize J(θ)(a non-convex function) as a function of parameters θ
Exercise by python
1. Neural Networks by the feedforward propagation algorithm
1.1 read the prepared dataset
def read_data():
path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_3/ex3/ex3data1.mat'
data1 = loadmat(path1)
X = data1['X']
y = data1['y']
return X, y
1.2 visiualize the dataset
def visuial_data(X, y):
rand_indices = np.random.choice(X.shape[0], 100)
# print(rand_indices.shape)
select = X[rand_indices, :]
# print(select.shape)
fig, ax = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True)
# print(ax.shape)
for r in range(10):
for c in range(10):
show = select[10*r+c, :].reshape((20, 20)).T
ax[r, c].matshow(show, cmap=cm.binary)
plt.xticks([])
plt.yticks([])
plt.show()
Output:
1.3 read the trained parameter θ
def read_weights():
path2 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_3/ex3/ex3weights.mat'
data2 = loadmat(path2)
theta1 = data2['Theta1']
theta2 = data2['Theta2']
return theta1, theta2
1.4 feedforward propagation
1.4.1sigmoid function
def sigmoidfunction(theta, X):
theta = np.matrix(theta)
X = np.matrix(X)
f = 1 / (1 + np.exp(- X * theta.T)) # (5000,401)*(401,25)=(5000,25); (5000,26)*(26,10)=(5000,10)
return f
1.4.2 feedforward propagation classification
def classification(theta1, theta2, X):
X_new = np.c_[np.matrix(np.ones((X.shape[0], 1))), X]
a1 = X_new # (5000,401)
# z2 = theta1 * a1.T # (25,401)*(401,5000)
a2 = np.c_[np.matrix(np.ones((X.shape[0], 1))), sigmoidfunction(theta1, a1)] # (5000,26)
a3 = sigmoidfunction(theta2, a2) #(5000,10)
prediction_compute = np.argmax(a3, axis=1) + 1
return prediction_compute
1.4.3 check the accuracy
def check_accuracy(y, prediction):
correct = np.matrix(np.zeros((y.shape[0], 1)))
for i in range(y.shape[0]):
if prediction[i, 0] == y[i, 0]:
correct[i, 0] = 1
else:
correct[i, 0] = 0
accuracy_compute = (np.sum(correct) / correct.shape[0]) * 100
return accuracy_compute
1.4.4 regularized cost function
def costfunction(theta1, theta2, a3, y, learning_rate):
theta1 = np.matrix(theta1)
theta2 = np.matrix(theta2)
a3 = np.matrix(a3)
y = np.matrix(y)
y_vector = np.matrix(np.zeros((y.shape[0], 10))) # transfer lables into 10-dimensional vectors
for i in range(y_vector.shape[0]):
y_vector[i, y[i, 0]-1] = 1
first = np.sum(- np.multiply(y_vector, np.log(a3)) - np.multiply((1 - y_vector), np.log(1 - a3))) / a3.shape[0] # (5000,10)*(10,5000)=(5000,10)
second = (learning_rate / (2 * a3.shape[0])) * (np.sum(np.power(theta1, 2)) + np.sum(np.power(theta2, 2)))
J = first + second
return J
1.5 main function
if __name__ == '__main__':
# number of labels
num_labels = 10 # label: 1-10
# read dataset form file
X, y = read_data()
# visialize dataset
visuial_data(X, y)
# read the trained parameter θ
theta1, theta2 = read_weights()
# classification
prediction = classification(theta1, theta2, X)
# compute the accuracy
accuracy = check_accuracy(y, prediction)
print('accuracy is {0}%'.format(accuracy))
# compute the cost function
learning_rate = 1
J = costfunction(theta1, theta2, a3, y, learning_rate)
print(J)
Output:
accuracy is 97.52%
J = 0.384487796242894
2. Neural Networks by backpropagation
The codes below seem to be correct but the result is not satisfied, so the codes are just tidied up temporarily
2.1 read the dataset
def read_data():
path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_4/ex4/ex4data1.mat'
data1 = loadmat(path1)
X = data1['X']
y = data1['y']
return X, y
2.2 visiualize the dataset
def visiualize_data(X):
rand_indices = np.random.choice(X.shape[0], 100)
X_select = X[rand_indices, :]
fig, ax = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True)
for r in range(10):
for c in range(10):
show = X_select[10*r+c, :].reshape((20, 20)).T
ax[r, c].matshow(show, cmap=cm.binary)
plt.xticks([])
plt.yticks([])
plt.show()
Output:
2.3 randomly initialize weights
def initialize_weights(input_units_size, hidden_units_size, output_units_size, epsilon):
theta1_initial = np.matrix(np.random.random((hidden_units_size, (input_units_size + 1))) * 2 * epsilon - epsilon) # (25,401)
theta2_initial = np.matrix(np.random.random((output_units_size, (hidden_units_size + 1))) * 2 * epsilon - epsilon) # (10,26)
return theta1_initial, theta2_initial
2.4 implement forward propagation to get h(x) for any x
# 2.4.1 sigmoid function
def sigmoidfunction(theta, X):
theta = np.matrix(theta)
X = np.matrix(X)
f = 1 / (1 + np.exp(- X * theta.T)) # (5000,401)*(401,25)=(5000,25), (5000,26)*(26,10)=(5000,10)
return f
# 2.4.2 feedforward propagation
def feedforward_propagation(theta1, theta2, X):
X_new = np.c_[np.matrix(np.ones((X.shape[0], 1))), X]
a1 = X_new # (5000,401)
z2 = a1 * theta1.T # (5000,401)*(401,25)=(5000,25)
a2 = np.c_[np.matrix(np.ones((X.shape[0], 1))), sigmoidfunction(theta1, a1)] # (5000,26)
z3 = a2 * theta2.T # (5000,26)*(26,10)=(5000,10)
a3 = sigmoidfunction(theta2, a2) # (5000,10)
prediction_compute = np.argmax(a3, axis=1) + 1
return a1, z2, a2, z3, a3, prediction_compute
2.5 compute cost function
def costfunction(theta1, theta2, a3, y, learning_rate):
theta1 = np.matrix(theta1)
theta2 = np.matrix(theta2)
a3 = np.matrix(a3)
y = np.matrix(y)
y_vector = np.matrix(np.zeros((y.shape[0], 10))) # transfer lables into 10-dimensional vectors
for i in range(y_vector.shape[0]):
y_vector[i, y[i, 0]-1] = 1
first = np.sum(- np.multiply(y_vector, np.log(a3)) - np.multiply((1 - y_vector), np.log(1 - a3))) / a3.shape[0] # (5000,10)*(10,5000)=(5000,10)
second = (learning_rate / (2 * a3.shape[0])) * (np.sum(np.power(theta1, 2)) + np.sum(np.power(theta2, 2)))
J = first + second
return J
2.6 implement backpropagation to compute partial derivatives
# 2.6.1 sigmoid gradient
def sigmoid_gradient(z):
f = 1 / (1 + np.exp(- z))
f_derivative = np.multiply(f, (1 - f))
return f_derivative
# backpropagation for training the neural networks——a comprehensive function
# in the backpropagation, this function does all the jobs including computing the costfunction and the gradients
def backpropagation_minimize(thetas, input_units_size, hidden_units_size, output_units_size, X, y, learning_rate):
y = np.matrix(y)
# separate the thetas into theta1 and theta2
theta1 = np.matrix(np.reshape(thetas[:(hidden_units_size * (input_units_size + 1))], (hidden_units_size, (input_units_size + 1))))
theta2 = np.matrix(np.reshape(thetas[(hidden_units_size * (input_units_size + 1)):], (output_units_size, (hidden_units_size + 1))))
# implement the forward propagation
a1, z2, a2, z3, a3, prediction_compute = feedforward_propagation(theta1, theta2, X)
# compute the costfunction
J = costfunction(theta1, theta2, a3, y, learning_rate)
# initialize the delta1 and delta2
delta1 = np.matrix(np.zeros(theta1.shape))
delta2 = np.matrix(np.zeros(theta2.shape))
for t in range(y.shape[0]):
delta3_s = a3[t, :] - y[t, :] # (1,10)
a1_t = a1[t, :] # (1,401)
z2_t = np.c_[np.matrix(np.zeros((1, 1))), z2[t, :]] # (1,26)
a2_t = a2[t, :] # (1,26)
delta2_s = np.multiply((theta2.T * delta3_s.T).T, sigmoid_gradient(z2_t)) # (1,26)
delta2 = delta2 + delta3_s.T * a2_t # (10,1)*(1,26)=(10,26)
delta1 = delta1 + delta2_s[0, 1:].T * a1_t # (25,1)*(1,401)=(25,401)
D1 = delta1 / y.shape[0] # (25,401)
D2 = delta2 / y.shape[0] # (10,26)
# regularized neural networks
D1[:, 1:] = D1[:, 1:] + ((learning_rate) / (y.shape[0])) * np.power(theta1[:, 1:], 2)
D2[:, 1:] = D2[:, 1:] + ((learning_rate) / (y.shape[0])) * np.power(theta2[:, 1:], 2)
D_matrix = np.r_[np.reshape(theta1_initial, ((theta1_initial.shape[0] * theta1_initial.shape[1]), 1)), np.reshape(theta2_initial, ((theta2_initial.shape[0] *theta2_initial.shape[1]), 1))]
# print(D_matrix.shape)
D = np.ravel(D_matrix)
return J, D
2.7 gradient checking
I haven't completed this part of code
# 2.7.1 compute the numerical gradients
def computenumericalgradients(thetas):
epsilon_check = 0.0001
thetas_check = np.matrix(np.zeros(thetas.shape))
numgradients = np.matrix(np.zeros(thetas.shape))
for i in range(thetas.shape[0]):
thetas_check[i, 1] = epsilon_check
# J_plus and J_minus haven't been completed due to the different parameters with the costfunction
J_plus = costfunction((thetas + thetas_check))
J_minus = costfunction((thetas - thetas_check))
numgradients[i, 1] = (J_plus - J_minus) / (2 * epsilon_check)
thetas_check[i, 1] = 0
return numgradients
# 2.7.2 compute the difference
def checkgradients(theta1, theta2, D1, D2):
thetas = np.c_[theta1[:], theta2[:]]
gradients_backpropagation = np.c_[D1[:], D2[:]]
numgradients = computenumericalgradients(thetas)
difference = gradients_backpropagation - numgradients
return difference
2.8 gradient descent to find the optimum parameter θ
def training_function(thetas, input_units_size, hidden_units_size, output_units_size, X, y, learning_rate):
# result = opt.minimize(fun=backpropagation_minimize, x0=(thetas), args=(input_units_size, hidden_units_size, output_units_size, X, y, learning_rate), method='CG', jac=True, options={'maxiter': 250})
result = opt.minimize(fun=backpropagation_minimize, x0=(thetas),args=(input_units_size, hidden_units_size, output_units_size, X, y, learning_rate), method='CG', jac=True, options={'maxiter': 1000})
print(result)
theta = result.x
theta1 = np.matrix(np.reshape(theta[:(hidden_units_size * (input_units_size + 1))], (hidden_units_size, (input_units_size + 1))))
theta2 = np.matrix(np.reshape(theta[(hidden_units_size * (input_units_size + 1)):], (output_units_size, (hidden_units_size + 1))))
return theta1, theta2
2.9 test for prediction
def prediction_function(theta1, theta2, X):
a1, z2, a2, z3, a3, prediction_test = feedforward_propagation(theta1, theta2, X)
return prediction_test
2.10 main function
if __name__ == '__main__':
# number of labels
num_labels = 10
# read the dataset
X, y = read_data()
# visiualize the dataset
visiualize_data(X)
# randomly initialize weights
input_units_size = 400
hidden_units_size = 25
output_units_size = 10
epsilon = 0.12
theta1_initial, theta2_initial = initialize_weights(input_units_size, hidden_units_size, output_units_size, epsilon)
# implement forward propagation
a1_initial, z2_initial, a2_initial, z3_initial, a3_initial, prediction_initial = feedforward_propagation(theta1_initial, theta2_initial, X)
# compute the cost function
learning_rate = 1
J_initial = costfunction(theta1_initial, theta2_initial, a3_initial, y, learning_rate)
print('J_initial = ', J_initial)
thetas = np.r_[np.reshape(theta1_initial, ((theta1_initial.shape[0] * theta1_initial.shape[1]), 1)), np.reshape(theta2_initial, ((theta2_initial.shape[0] *theta2_initial.shape[1]), 1))]
# gradient descent to find the optimum parameter θ
theta1, theta2 = training_function(thetas, input_units_size, hidden_units_size, output_units_size, X, y, learning_rate)
print('θ_1 = ', theta1)
print('θ_2 = ', theta2)
# test for prediction
prediction_test = prediction_function(theta1, theta2, X)
# compute the accuracy
print(classification_report(y, prediction_test))
correct = np.matrix(np.zeros((y.shape[0], 1)))
for i in range(y.shape[0]):
if prediction_test[i] == y[i, 0]:
correct[i, 0] = 1
else:
correct[i, 0] = 0
accuracy = (np.sum(correct) / correct.shape[0]) * 100
print('accuracy is {0}%'.format(accuracy))
Output:
θ_1 = [[ 0.00093676 0.0327831 -0.08774771 ... 0.01469842 -0.11366838
0.04732772]
[ 0.00541461 0.00313528 -0.01978898 ... -0.06623027 -0.03099227
-0.0627574 ]
[-0.09962305 -0.03641729 0.07920927 ... -0.02946693 -0.09628543
-0.03908287]
...
[-0.06869626 0.06521359 0.00053157 ... 0.00708912 -0.0402725
0.04879262]
[ 0.09867833 0.04986986 0.07670727 ... 0.04147549 0.09081925
0.11860844]
[ 0.03973169 -0.08732779 0.00238041 ... 0.11474445 -0.01584497
-0.02824676]]
θ_2 = [[-0.0751349 0.01220147 -0.00187009 0.10926564 0.1157885 0.09799796
-0.05254812 -0.02862972 0.08970652 -0.0280278 -0.1128644 -0.03723926
-0.11910176 -0.1099583 0.11375186 0.11434061 0.09047654 0.10998851
0.03165789 -0.03080743 0.06411663 -0.07459747 -0.10853152 0.09204101
0.11070508 -0.11126005]
[ 0.03740018 -0.03794814 -0.11878201 0.09821531 0.03147092 0.05811775
0.11927267 0.08902646 -0.02155084 0.00913861 0.01274126 -0.07575387
0.02722365 0.07587161 0.07568325 -0.11079793 -0.0898476 0.09887415
-0.04784729 -0.07902443 -0.00064729 -0.05502922 -0.04877157 0.00964495
-0.09990133 -0.04410787]
[-0.09276947 -0.02575454 -0.05379186 0.01018542 -0.02337626 -0.05306278
0.08630749 0.10712696 -0.05829341 0.09117723 0.03219847 0.09626431
-0.06929835 -0.08147247 -0.06141049 -0.05800682 -0.0667897 -0.11884382
0.05850218 -0.04898403 0.01673019 -0.0828654 0.05606018 -0.06753032
0.11130325 0.1120622 ]
[ 0.06434803 -0.0461836 0.10898068 -0.03483853 -0.0786496 0.06239861
-0.08276314 -0.10419294 0.01761108 0.11485193 -0.10611348 0.02844327
0.01238147 -0.11832113 -0.00106211 0.0896159 0.00494072 0.11194076
0.11991444 0.07693659 0.07436995 0.10781727 0.07065063 0.08087252
-0.08475401 0.04492129]
[ 0.00617256 -0.10020298 0.03600673 -0.00121347 0.01537988 0.08469746
0.11490091 0.01088844 -0.11466824 -0.11164414 -0.09614668 0.00844582
0.01413117 -0.01783809 -0.05669911 -0.03689826 0.0185126 0.08850692
0.06920852 0.05764374 -0.1097547 -0.03367873 0.05408404 -0.1171858
0.04675774 -0.07240418]
[ 0.05956696 -0.01751503 -0.05076935 0.00227092 0.09314919 -0.08683572
-0.00115794 -0.01464717 -0.08990213 -0.1091756 0.10848487 -0.00037829
-0.03633219 -0.01343996 0.0047237 -0.06841683 -0.00717001 0.00165824
-0.05505802 -0.04723638 0.07363739 0.00664931 -0.04622215 0.10349759
0.11233158 0.07261834]
[ 0.02088182 0.08646296 0.09306419 -0.08968333 -0.08750483 -0.04287509
0.08689379 0.0216704 -0.10112271 -0.06484703 -0.10443928 0.08688154
0.00647283 0.00514767 0.03129209 -0.04528363 -0.02947967 -0.09140532
-0.03190937 0.01607406 -0.11620516 0.11551845 -0.05828492 0.09838211
0.08790569 -0.01507423]
[ 0.00880364 0.03918421 -0.11664108 -0.01804836 0.00962997 -0.04744811
0.10806898 0.03297559 -0.11835414 -0.09431796 -0.10174431 0.05553892
-0.08492161 -0.02644175 0.04786368 0.01133686 -0.00406577 -0.10515024
0.10377942 -0.03291552 -0.05442823 0.0770617 0.03870792 -0.04150963
0.03215438 -0.05773026]
[-0.01929414 -0.00929312 -0.08156575 -0.07252875 0.02766874 -0.05009246
0.09333048 0.01501196 0.09234671 0.08167456 -0.04593689 -0.00252267
-0.08298099 0.00891856 0.05796965 0.06729885 -0.10021811 -0.07282444
0.01665613 -0.09640974 -0.07075257 0.05663896 0.00590476 -0.06053229
-0.023461 -0.11240952]
[-0.03532614 0.02855168 0.00041987 -0.07064003 -0.11732709 0.11933669
-0.01733265 -0.10903647 -0.0939109 0.00279991 0.09956265 0.02014653
0.05208287 -0.0077789 0.04401294 0.08964575 -0.08900355 -0.0125574
0.04025017 0.11615695 0.0341768 -0.01047185 -0.06442586 0.02192697
0.08802504 0.07518197]]
precision recall f1-score support 1 0.00 0.00 0.00 500 2 0.00 0.00 0.00 500 3 0.00 0.00 0.00 500 4 0.10 1.00 0.18 500 5 0.00 0.00 0.00 500 6 0.00 0.00 0.00 500 7 0.00 0.00 0.00 500 8 0.00 0.00 0.00 500 9 0.00 0.00 0.00 500 10 0.00 0.00 0.00 500 accuracy 0.10 5000
macro avg 0.01 0.10 0.02 5000 weighted avg 0.01 0.10 0.02 5000
accuracy is 10.0%