8. Dimentionality Reduction - ZYL-Harry/Machine_Learning_study GitHub Wiki

Function

data compression:
1.reduce memory/disk needed to store data
2.speed up learning algorithm
data visiualization

Dimentionality Reduction by K-means

treat every pixel in the original image as a data example and use the K-means algorithm to find the 16 colors that best group (cluster) the pixels in the 3-dimensional RGB space
use the 16 colors to replace the pixels in the original image
The original image required 24 bits for each one of the 128×128 pixel locations, resulting in total size of 128 × 128 × 24 = 393, 216 bits. The new representation requires some overhead storage in form of a dictionary of 16 colors, each of which require 24 bits, but the image itself then only requires 4 bits per pixel location. The final number of bits used is therefore 16 × 24 + 128 × 128 × 4 = 65, 920 bits, which corresponds to compressing the original image by about a factor of 6.

The original image:
ex7_original_figure
main funtion:

'''K-Means Clustering on Pixels'''
# read the picture
path2 = 'bird_small.png'
image1 = cv2.imread(path2)
# convert BGR to RGB and show the picture
image1_RGB = cv2.cvtColor(image1, cv2.COLOR_BGR2RGB)
plt.figure()
plt.imshow(image1_RGB)
plt.show()
# read the pixel data of the image
path3 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_7/ex7/bird_small.mat'
data_image1 = loadmat(path3)
A = data_image1['A']
A = A / 255
X_image = np.reshape(A, ((A.shape[0] * A.shape[1]), 3))
# initialize parameters
K_image = 16
# initialize the centroids randomly
initial_centroids_image = initialize_centroids(X_image, K_image)
# run the k-means algorithm
max_iter = 10
new_centroids_image, X_image_index = k_means(X_image, K_image, initial_centroids_image, max_iter)
print('the primary colours of the image are \n', new_centroids_image)
# visualize the new picture with the primary colours
X_new_image = new_centroids_image[X_image_index.flatten().A[0].astype(int), :]
X_new_image_ndarray = X_new_image.A
X_new_image_show = np.reshape(X_new_image_ndarray, (A.shape[0], A.shape[1], A.shape[2]))
plt.figure()
plt.imshow(X_new_image_show)
plt.show()

initialize the centroids:

# initialize the centroids randomly---choosing the data points as the initial centroids
def initialize_centroids(X, K):
    initial_centroids_index = np.random.randint(0, X.shape[0], K)
    initial_centroids_image = X[initial_centroids_index[:], :]
    return initial_centroids_image

Output:

the primary colours of the image are
[[0.53579899 0.41471567 0.28958293]
[0.39888947 0.28238186 0.18749494]
[0.32387002 0.33175611 0.37674676]
[0.07484301 0.08048187 0.06643599]
[0.15902783 0.16201949 0.15319479]
[0.223894 0.21537059 0.21139567]
[0.85642302 0.70817128 0.46666813]
[0.12074435 0.13021107 0.1212539 ]
[0.0536679 0.05920746 0.0495955 ]
[0.75721517 0.55372908 0.27725737]
[0.10075236 0.1083573 0.09786492]
[0.93647348 0.88152819 0.7562851 ]
[0.08736567 0.09471283 0.08289522]
[0.59247699 0.56876084 0.58917967]
[0.0758476 0.08286445 0.07866745]
[0.06369265 0.06948119 0.06020265]]

Principal Component Analysis

Goal: find a lower-dimensional surface(spanned by k vectors u^{1}, u^{2},..., u^{k}) onto which to project the data so as to minimize the sum of the projection error(the sum of the squares of the distance is minimized)
PCA VS. Linear Regression:
Linear Regression: fitting a straight line so as to minimize the squared error(along the y axis) between the data points and the straight line
PCA: find a straight line so as to minimize the squared error(projection error, the shortest orthogonal distances)

Principal Component Analysis Algorithm

Data preprocessing:
Ideal:
1.x^{i}: the original features(i=1,2,...n, m-dimension)
2.u^{i}: the eigenvector of the low-dimensional surface selected(i=1,2,...k, k-dimension)
3.z^{i}: the new features made in the low-dimensional surface(i=1,2,...,n, k-dimension)
Process:

Restruction from compressed representation

Choosing the number of principal components

Tips for applying PCA

Supervised learning speedup:
extract the inputs to an unlabeled dataset and take PCA to it to form a new training set
Although fewer features is less likely to cause overfitting, as applying PCA may throw some valuable information, when applying PCA with low variance it is bad to use it to prevent overfitting instead of adding regularization part
Careful when using PCA:
Only when the raw dataset causes problems like algorithm's running too slow, or memory/disk requirement too large, we can consider using PCA

Exercise by python

Mission: use PCA to reduce the data from 2D to 1D

read the dataset

def read_data(path):
    data = loadmat(path)
    X = data['X']
    return X

visualize the dataset

def visualize_data(X):
    x1 = X[:, 0]
    x2 = X[:, 1]
    plt.figure()
    plt.scatter(x=x1, y=x2, color='b', marker='o')
    plt.show()

Output:
ex7data_pca

normalize the dataset

def feature_normalized(X):
    X_normalized = np.matrix(np.zeros(X.shape))
    for i in range(X.shape[1]):
        X_normalized[:, i] = np.matrix((X[:, i] - X[:, i].mean()) / X[:, i].std()).T
    return X_normalized

def pca(X, K):
    Sigma = (1 / X.shape[0]) * (X.T * X)    # (2,50)*(50,2)=(2,2)
    U, S, V = np.linalg.svd(Sigma)
    U_reduce = U[:, :K] # (2,1)
    # Z = np.matrix(np.zeros((X.shape[0], K)))    # (50,1) new features---primary components
    Z = (U_reduce.T * X.T).T  # (1,2).*(2,50)=(1,50)
    return Z, U_reduce

Output:

projections of the original features on the primary components are
[[ 1.49631261]
[-0.92218067]
[ 1.22439232]
[ 1.64386173]
[ 1.2732206 ]
[-0.97681976]
[ 1.26881187]
[-2.34148278]
[-0.02999141]
[-0.78171789]
[-0.6316777 ]
[-0.55280135]
[-0.0896816 ]
[-0.5258541 ]
[ 1.56415455]
[-1.91610366]
[-0.88679735]
[ 0.95607375]
[-2.32995679]
[-0.47793862]
[-2.21747195]
[ 0.38900633]
[-1.78482346]
[ 0.05175486]
[ 1.66512392]
[ 0.50813572]
[-1.23711018]
[-1.17198677]
[ 0.84221686]
[-0.00693174]
[-0.22794195]
[-1.51309518]
[ 1.33874082]
[-0.5925244 ]
[ 0.67907605]
[-1.35298 ]
[ 1.68749495]
[-1.39235931]
[ 2.55992598]
[-0.27850702]
[-0.97677692]
[ 0.88820006]
[ 1.29666127]
[-0.98966774]
[ 1.81272352]
[-0.27196356]
[ 3.19297722]
[ 1.21299151]
[ 0.36792871]
[-1.44264131]]

reconstruct the dataset

def reconstruct_data(Z, U_reduce):
    X_approximate = (U_reduce * Z.T).T    # (2,1)*(1,50)=(2,50)
    return X_approximate

Output:

approximations of the new features are:
[[-1.05805279 -1.05805279]
[ 0.65208021 0.65208021]
[-0.86577611 -0.86577611]
[-1.16238578 -1.16238578]
[-0.90030292 -0.90030292]
[ 0.69071588 0.69071588]
[-0.89718548 -0.89718548]
[ 1.65567835 1.65567835]
[ 0.02120713 0.02120713]
[ 0.55275802 0.55275802]
[ 0.44666359 0.44666359]
[ 0.39088959 0.39088959]
[ 0.06341447 0.06341447]
[ 0.371835 0.371835 ]
[-1.10602429 -1.10602429]
[ 1.35488989 1.35488989]
[ 0.62706042 0.62706042]
[-0.67604623 -0.67604623]
[ 1.64752825 1.64752825]
[ 0.33795364 0.33795364]
[ 1.56798945 1.56798945]
[-0.27506901 -0.27506901]
[ 1.26206077 1.26206077]
[-0.03659622 -0.03659622]
[-1.17742041 -1.17742041]
[-0.35930621 -0.35930621]
[ 0.874769 0.874769 ]
[ 0.82871979 0.82871979]
[-0.59553725 -0.59553725]
[ 0.00490148 0.00490148]
[ 0.1611793 0.1611793 ]
[ 1.06991986 1.06991986]
[-0.94663271 -0.94663271]
[ 0.41897802 0.41897802]
[-0.48017928 -0.48017928]
[ 0.95670134 0.95670134]
[-1.19323912 -1.19323912]
[ 0.98454671 0.98454671]
[-1.81014102 -1.81014102]
[ 0.1969342 0.1969342 ]
[ 0.69068559 0.69068559]
[-0.62805228 -0.62805228]
[-0.91687797 -0.91687797]
[ 0.69980077 0.69980077]
[-1.28178909 -1.28178909]
[ 0.19230728 0.19230728]
[-2.25777584 -2.25777584]
[-0.85771452 -0.85771452]
[-0.26016489 -0.26016489]
[ 1.02010145 1.02010145]]

draw connecting lines

def draw_connections(X_normalized, X_approximate):
    plt.figure()
    plt.scatter(x=X_normalized[:, 0].flatten().A[0], y=X_normalized[:, 1].flatten().A[0], marker='o', facecolors='none', edgecolors='b')
    plt.scatter(x=X_approximate[:, 0].flatten().A[0], y=X_approximate[:, 1].flatten().A[0], marker='o', facecolors='none', edgecolors='r')
    for i in range(X_normalized.shape[0]):
        x1 = X_normalized[i, 0]
        y1 = X_normalized[i, 1]
        x2 = X_approximate[i, 0]
        y2 = X_approximate[i, 1]
        plt.plot([x1, x2], [y1, y2], color='k')
    plt.axis([-4, 3, -4, 3])
    plt.show()

Output:
ex7_pca

main function

if __name__ == '__main__':
    # read the dataset
    path1 = 'D:/新建文件夹/机器学习/Machine_Learning_exercise/exercise_7/ex7/ex7data1.mat'
    X1 = read_data(path1)
    # visualize the dataset
    visualize_data(X1)
    # data preprocessing
    # normalization
    X1_normalized = feature_normalized(X1)
    # run the PCA algorithm
    K1 = 1
    Z1, U1_reduce = pca(X1_normalized, K1)
    print('projections of the original features on the primary components are \n', Z1)
    # reconstruct the approximation of the data
    X1_approximate = reconstruct_data(Z1, U1_reduce)
    print('approximations of the new features are: \n', X1_approximate)
    # draw lines connecting the projections to the original data points
    draw_connections(X1_normalized, X1_approximate)