Wiki Report for ICP6 - NagaSurendraBethapudi/Python-ICP GitHub Wiki
https://drive.google.com/file/d/1e7d9VdqepMZizdDZy4Bdg_AhS30KttYJ/view?usp=sharing
Video link :Question 1 :
Apply K means clustering in the data set provided below:https://umkc.box.com/s/a9lzu9qoqfkbhjwk5nz9m6dyybhl1wqy
- Remove any null values by the mean.
- Use the elbow method to find a good number of clusters with the KMeans algorithm
- Calculate the silhouette score for the above clustering
Explanation :
- Removed the null values by mean
- Found null values in CREDIT_LIMIT and MINIMUM_PAYMENTS features
- Replaced those null values with mean by using logic:
mean_value=Credit_Card_Data['MINIMUM_PAYMENTS'].mean()
mean_value_CREDIT_LIMIT=Credit_Card_Data['CREDIT_LIMIT'].mean()
- Used the elbow method for finding good k value:
- Calculated the silhouette score.
Question 2 :
Try feature scaling and then apply KMeans on the scaled features. Did that improve the Silhouette score? If Yes, can you justify why
Answers :
- Done Feature scaling using below logic :
scaler = preprocessing.StandardScaler()
scaler.fit(x)
X_scaled_array = scaler.transform(x)
X_scaled = pd.DataFrame(X_scaled_array, columns = x.columns)
- Applied Kmeans and got silhouette score
nclusters = 4 # this is the k in kmeans from above graph
km = KMeans(n_clusters=nclusters)
km.fit(X_scaled)
y_cluster_kmeans = km.predict(X_scaled)
score = metrics.silhouette_score(X_scaled, y_cluster_kmeans)
No, Silhouette score was not improved
Question 3 :
Apply PCA on the same dataset. Apply kMeans algorithm on the PCA result and report your observation if the silhouette score improved or not?
Answers :
- Applied PCA :
scaler = StandardScaler() #Applying scaling
scaler.fit(x)
x_scaler = scaler.transform(x)
pca = PCA(2)
x_pca = pca.fit_transform(x_scaler)
newdata = pd.DataFrame(data=x_pca)
- Applied Kmeans :
nclusters = 4 # this is the k in kmeans from above graph
km = KMeans(n_clusters=nclusters)
km.fit(x_pca)
After applying PCA and KMeans we got silhouette score of 0.47
Question 4 :
Visualize the clustering
Answer :
plotted the top most correlated features with TENURE feature.
Conclusion :
Silhouette score better with raw data , after applying PCA, Feature extraction , Kmeans silhouette score was decreased
Challenges:
NA