Module 1_ICP 6: Clustering Techniques - acikgozmehmet/PythonDeepLearning GitHub Wiki

Clustering Techniques

Objectives:

The following topics are covered.

Clustering using KMeans
Dimension reduction using PCA

Overview

a. K-means Clustering

K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

b. Principle Component Analysis-PCA

Given a collection of points in two, three, or higher dimensional space, a "best fitting" line can be defined as one that minimizes the average squared distance from a point to the line. The next best-fitting line can be similarly chosen from directions perpendicular to the first. Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components, and several related procedures principal component analysis (PCA).

PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often used to visualize genetic distance and relatedness between populations. PCA is either done by singular value decomposition of a design matrix or by doing the following 2 steps:

calculating the data covariance (or correlation) matrix of the original data
performing eigenvalue decomposition on the covariance matrix

In Class Programming

Click here to get the source code

1. Apply K means clustering in this data set provided below:

• Remove any null values by the mean.

• Use the elbow method to find a good number of clusters with the KMeans algorithm

The following pairplot shows in diagonal that the features do not have normal distribution. That is why, silhoutte score is still worse after applying the PCA.

We can conclude that K=3 will be a good value for approximation.

2. Calculate the silhouette score for the above clustering

3. Try feature scaling to see if it will improve the Silhouette score

4. Apply PCA on the same dataset.

Bonus points

1. Apply kmeans algorithm on the PCA result and report your observation if the score improved or not?

a. You can try different variation like PCA+KMEANS , SCALING+PCA+KMEANS.

2. Visualize the clustering of first bonus question

SCALING+PCA+KMEANS.

Visualization:

PCA+KMEANS

Visualization:

We can conclude that the features do not have normal distribution. That is why scaling the features does not help to silhoutte score. We can also try MinMaxScaler to compare the results.

References

http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

https://web.stanford.edu/~schmit/cme193/lec/lec5.pdf

http://machinelearningmastery.com/machine-learning-in-python-step-by-step/

http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/

http://scikit-learn.org/stable/modules/clustering.html

http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html

http://www.kdnuggets.com/2015/11/seven-steps-machine-learning-python.html/2

http://beancoder.com/linear-regression-stock-prediction/

https://github.com/tarlen5/coursera_ml/blob/master/unit6/ex2_sklearn

https://www.kaggle.com/bburns/iris-exploration-pca-k-means-and-gmm-clustering

https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

https://en.wikipedia.org/wiki/K-means_clustering

https://en.wikipedia.org/wiki/Principal_component_analysis