8. Dimentionality Reduction - ZYL-Harry/Machine_Learning_study GitHub Wiki
Function
- data compression:
1.reduce memory/disk needed to store data
2.speed up learning algorithm - data visiualization
Dimentionality Reduction by K-means
- treat every pixel in the original image as a data example and use the K-means algorithm to find the 16 colors that best group (cluster) the pixels in the 3-dimensional RGB space
- use the 16 colors to replace the pixels in the original image
- The original image required 24 bits for each one of the 128×128 pixel locations, resulting in total size of 128 × 128 × 24 = 393, 216 bits. The new representation requires some overhead storage in form of a dictionary of 16 colors, each of which require 24 bits, but the image itself then only requires 4 bits per pixel location. The final number of bits used is therefore 16 × 24 + 128 × 128 × 4 = 65, 920 bits, which corresponds to compressing the original image by about a factor of 6.
The original image:
main funtion:
'''K-Means Clustering on Pixels'''
# read the picture
path2 = 'bird_small.png'
image1 = cv2.imread(path2)
# convert BGR to RGB and show the picture
image1_RGB = cv2.cvtColor(image1, cv2.COLOR_BGR2RGB)
plt.figure()
plt.imshow(image1_RGB)
plt.show()
# read the pixel data of the image
path3 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_7/ex7/bird_small.mat'
data_image1 = loadmat(path3)
A = data_image1['A']
A = A / 255
X_image = np.reshape(A, ((A.shape[0] * A.shape[1]), 3))
# initialize parameters
K_image = 16
# initialize the centroids randomly
initial_centroids_image = initialize_centroids(X_image, K_image)
# run the k-means algorithm
max_iter = 10
new_centroids_image, X_image_index = k_means(X_image, K_image, initial_centroids_image, max_iter)
print('the primary colours of the image are \n', new_centroids_image)
# visualize the new picture with the primary colours
X_new_image = new_centroids_image[X_image_index.flatten().A[0].astype(int), :]
X_new_image_ndarray = X_new_image.A
X_new_image_show = np.reshape(X_new_image_ndarray, (A.shape[0], A.shape[1], A.shape[2]))
plt.figure()
plt.imshow(X_new_image_show)
plt.show()
initialize the centroids:
# initialize the centroids randomly---choosing the data points as the initial centroids
def initialize_centroids(X, K):
initial_centroids_index = np.random.randint(0, X.shape[0], K)
initial_centroids_image = X[initial_centroids_index[:], :]
return initial_centroids_image
Output:
the primary colours of the image are
[[0.53579899 0.41471567 0.28958293]
[0.39888947 0.28238186 0.18749494]
[0.32387002 0.33175611 0.37674676]
[0.07484301 0.08048187 0.06643599]
[0.15902783 0.16201949 0.15319479]
[0.223894 0.21537059 0.21139567]
[0.85642302 0.70817128 0.46666813]
[0.12074435 0.13021107 0.1212539 ]
[0.0536679 0.05920746 0.0495955 ]
[0.75721517 0.55372908 0.27725737]
[0.10075236 0.1083573 0.09786492]
[0.93647348 0.88152819 0.7562851 ]
[0.08736567 0.09471283 0.08289522]
[0.59247699 0.56876084 0.58917967]
[0.0758476 0.08286445 0.07866745]
[0.06369265 0.06948119 0.06020265]]
Principal Component Analysis
- Goal: find a lower-dimensional surface(spanned by k vectors u^{1}, u^{2},..., u^{k}) onto which to project the data so as to minimize the sum of the projection error(the sum of the squares of the distance is minimized)
- PCA VS. Linear Regression:
Linear Regression: fitting a straight line so as to minimize the squared error(along the y axis) between the data points and the straight line
PCA: find a straight line so as to minimize the squared error(projection error, the shortest orthogonal distances)
Principal Component Analysis Algorithm
- Data preprocessing:
- Ideal:
1.x^{i}: the original features(i=1,2,...n, m-dimension)
2.u^{i}: the eigenvector of the low-dimensional surface selected(i=1,2,...k, k-dimension)
3.z^{i}: the new features made in the low-dimensional surface(i=1,2,...,n, k-dimension)
- Process:
Restruction from compressed representation
Choosing the number of principal components
Tips for applying PCA
- Supervised learning speedup:
extract the inputs to an unlabeled dataset and take PCA to it to form a new training set
- Although fewer features is less likely to cause overfitting, as applying PCA may throw some valuable information, when applying PCA with low variance it is bad to use it to prevent overfitting instead of adding regularization part
- Careful when using PCA:
Only when the raw dataset causes problems like algorithm's running too slow, or memory/disk requirement too large, we can consider using PCA
Exercise by python
- Mission: use PCA to reduce the data from 2D to 1D
- read the dataset
def read_data(path):
data = loadmat(path)
X = data['X']
return X
- visualize the dataset
def visualize_data(X):
x1 = X[:, 0]
x2 = X[:, 1]
plt.figure()
plt.scatter(x=x1, y=x2, color='b', marker='o')
plt.show()
Output:
- normalize the dataset
def feature_normalized(X):
X_normalized = np.matrix(np.zeros(X.shape))
for i in range(X.shape[1]):
X_normalized[:, i] = np.matrix((X[:, i] - X[:, i].mean()) / X[:, i].std()).T
return X_normalized
- PCA
def pca(X, K):
Sigma = (1 / X.shape[0]) * (X.T * X) # (2,50)*(50,2)=(2,2)
U, S, V = np.linalg.svd(Sigma)
U_reduce = U[:, :K] # (2,1)
# Z = np.matrix(np.zeros((X.shape[0], K))) # (50,1) new features---primary components
Z = (U_reduce.T * X.T).T # (1,2).*(2,50)=(1,50)
return Z, U_reduce
Output:
projections of the original features on the primary components are
[[ 1.49631261]
[-0.92218067]
[ 1.22439232]
[ 1.64386173]
[ 1.2732206 ]
[-0.97681976]
[ 1.26881187]
[-2.34148278]
[-0.02999141]
[-0.78171789]
[-0.6316777 ]
[-0.55280135]
[-0.0896816 ]
[-0.5258541 ]
[ 1.56415455]
[-1.91610366]
[-0.88679735]
[ 0.95607375]
[-2.32995679]
[-0.47793862]
[-2.21747195]
[ 0.38900633]
[-1.78482346]
[ 0.05175486]
[ 1.66512392]
[ 0.50813572]
[-1.23711018]
[-1.17198677]
[ 0.84221686]
[-0.00693174]
[-0.22794195]
[-1.51309518]
[ 1.33874082]
[-0.5925244 ]
[ 0.67907605]
[-1.35298 ]
[ 1.68749495]
[-1.39235931]
[ 2.55992598]
[-0.27850702]
[-0.97677692]
[ 0.88820006]
[ 1.29666127]
[-0.98966774]
[ 1.81272352]
[-0.27196356]
[ 3.19297722]
[ 1.21299151]
[ 0.36792871]
[-1.44264131]]
- reconstruct the dataset
def reconstruct_data(Z, U_reduce):
X_approximate = (U_reduce * Z.T).T # (2,1)*(1,50)=(2,50)
return X_approximate
Output:
approximations of the new features are:
[[-1.05805279 -1.05805279]
[ 0.65208021 0.65208021]
[-0.86577611 -0.86577611]
[-1.16238578 -1.16238578]
[-0.90030292 -0.90030292]
[ 0.69071588 0.69071588]
[-0.89718548 -0.89718548]
[ 1.65567835 1.65567835]
[ 0.02120713 0.02120713]
[ 0.55275802 0.55275802]
[ 0.44666359 0.44666359]
[ 0.39088959 0.39088959]
[ 0.06341447 0.06341447]
[ 0.371835 0.371835 ]
[-1.10602429 -1.10602429]
[ 1.35488989 1.35488989]
[ 0.62706042 0.62706042]
[-0.67604623 -0.67604623]
[ 1.64752825 1.64752825]
[ 0.33795364 0.33795364]
[ 1.56798945 1.56798945]
[-0.27506901 -0.27506901]
[ 1.26206077 1.26206077]
[-0.03659622 -0.03659622]
[-1.17742041 -1.17742041]
[-0.35930621 -0.35930621]
[ 0.874769 0.874769 ]
[ 0.82871979 0.82871979]
[-0.59553725 -0.59553725]
[ 0.00490148 0.00490148]
[ 0.1611793 0.1611793 ]
[ 1.06991986 1.06991986]
[-0.94663271 -0.94663271]
[ 0.41897802 0.41897802]
[-0.48017928 -0.48017928]
[ 0.95670134 0.95670134]
[-1.19323912 -1.19323912]
[ 0.98454671 0.98454671]
[-1.81014102 -1.81014102]
[ 0.1969342 0.1969342 ]
[ 0.69068559 0.69068559]
[-0.62805228 -0.62805228]
[-0.91687797 -0.91687797]
[ 0.69980077 0.69980077]
[-1.28178909 -1.28178909]
[ 0.19230728 0.19230728]
[-2.25777584 -2.25777584]
[-0.85771452 -0.85771452]
[-0.26016489 -0.26016489]
[ 1.02010145 1.02010145]]
- draw connecting lines
def draw_connections(X_normalized, X_approximate):
plt.figure()
plt.scatter(x=X_normalized[:, 0].flatten().A[0], y=X_normalized[:, 1].flatten().A[0], marker='o', facecolors='none', edgecolors='b')
plt.scatter(x=X_approximate[:, 0].flatten().A[0], y=X_approximate[:, 1].flatten().A[0], marker='o', facecolors='none', edgecolors='r')
for i in range(X_normalized.shape[0]):
x1 = X_normalized[i, 0]
y1 = X_normalized[i, 1]
x2 = X_approximate[i, 0]
y2 = X_approximate[i, 1]
plt.plot([x1, x2], [y1, y2], color='k')
plt.axis([-4, 3, -4, 3])
plt.show()
Output:
- main function
if __name__ == '__main__':
# read the dataset
path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_7/ex7/ex7data1.mat'
X1 = read_data(path1)
# visualize the dataset
visualize_data(X1)
# data preprocessing
# normalization
X1_normalized = feature_normalized(X1)
# run the PCA algorithm
K1 = 1
Z1, U1_reduce = pca(X1_normalized, K1)
print('projections of the original features on the primary components are \n', Z1)
# reconstruct the approximation of the data
X1_approximate = reconstruct_data(Z1, U1_reduce)
print('approximations of the new features are: \n', X1_approximate)
# draw lines connecting the projections to the original data points
draw_connections(X1_normalized, X1_approximate)