Module 1_ICP 6: Clustering Techniques - acikgozmehmet/PythonDeepLearning GitHub Wiki
Clustering Techniques
Objectives:
The following topics are covered.
- Clustering using KMeans
- Dimension reduction using PCA
Overview
a. K-means Clustering
K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
b. Principle Component Analysis-PCA
Given a collection of points in two, three, or higher dimensional space, a "best fitting" line can be defined as one that minimizes the average squared distance from a point to the line. The next best-fitting line can be similarly chosen from directions perpendicular to the first. Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components, and several related procedures principal component analysis (PCA).
PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often used to visualize genetic distance and relatedness between populations. PCA is either done by singular value decomposition of a design matrix or by doing the following 2 steps:
- calculating the data covariance (or correlation) matrix of the original data
- performing eigenvalue decomposition on the covariance matrix
In Class Programming
Click here to get the source code
1. Apply K means clustering in this data set provided below:
• Remove any null values by the mean.
• Use the elbow method to find a good number of clusters with the KMeans algorithm
The following pairplot shows in diagonal that the features do not have normal distribution. That is why, silhoutte score is still worse after applying the PCA.
We can conclude that K=3 will be a good value for approximation.
2. Calculate the silhouette score for the above clustering
3. Try feature scaling to see if it will improve the Silhouette score
4. Apply PCA on the same dataset.
Bonus points
1. Apply kmeans algorithm on the PCA result and report your observation if the score improved or not?
a. You can try different variation like PCA+KMEANS , SCALING+PCA+KMEANS.
2. Visualize the clustering of first bonus question
SCALING+PCA+KMEANS.
Visualization:
PCA+KMEANS
Visualization:
We can conclude that the features do not have normal distribution. That is why scaling the features does not help to silhoutte score. We can also try MinMaxScaler to compare the results.
References
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf
https://web.stanford.edu/~schmit/cme193/lec/lec5.pdf
http://machinelearningmastery.com/machine-learning-in-python-step-by-step/
http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/
http://scikit-learn.org/stable/modules/clustering.html
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
http://www.kdnuggets.com/2015/11/seven-steps-machine-learning-python.html/2
http://beancoder.com/linear-regression-stock-prediction/
https://github.com/tarlen5/coursera_ml/blob/master/unit6/ex2_sklearn
https://www.kaggle.com/bburns/iris-exploration-pca-k-means-and-gmm-clustering
https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60