Wiki Report for ICP6 - NagaSurendraBethapudi/Python-ICP GitHub Wiki

Video link : https://drive.google.com/file/d/1e7d9VdqepMZizdDZy4Bdg_AhS30KttYJ/view?usp=sharing


Question 1 :

Apply K means clustering in the data set provided below:https://umkc.box.com/s/a9lzu9qoqfkbhjwk5nz9m6dyybhl1wqy

  1. Remove any null values by the mean.
  2. Use the elbow method to find a good number of clusters with the KMeans algorithm
  3. Calculate the silhouette score for the above clustering

Explanation :

  1. Removed the null values by mean
  • Found null values in CREDIT_LIMIT and MINIMUM_PAYMENTS features
  • Replaced those null values with mean by using logic:
    • mean_value=Credit_Card_Data['MINIMUM_PAYMENTS'].mean()
    • mean_value_CREDIT_LIMIT=Credit_Card_Data['CREDIT_LIMIT'].mean()
  1. Used the elbow method for finding good k value:

  1. Calculated the silhouette score.


Question 2 :

Try feature scaling and then apply KMeans on the scaled features. Did that improve the Silhouette score? If Yes, can you justify why

Answers :

  1. Done Feature scaling using below logic :
  • scaler = preprocessing.StandardScaler()
  • scaler.fit(x)
  • X_scaled_array = scaler.transform(x)
  • X_scaled = pd.DataFrame(X_scaled_array, columns = x.columns)

  1. Applied Kmeans and got silhouette score
  • nclusters = 4 # this is the k in kmeans from above graph
  • km = KMeans(n_clusters=nclusters)
  • km.fit(X_scaled)
  • y_cluster_kmeans = km.predict(X_scaled)
  • score = metrics.silhouette_score(X_scaled, y_cluster_kmeans)

No, Silhouette score was not improved


Question 3 :

Apply PCA on the same dataset. Apply kMeans algorithm on the PCA result and report your observation if the silhouette score improved or not?

Answers :

  1. Applied PCA :
  • scaler = StandardScaler() #Applying scaling
  • scaler.fit(x)
  • x_scaler = scaler.transform(x)
  • pca = PCA(2)
  • x_pca = pca.fit_transform(x_scaler)
  • newdata = pd.DataFrame(data=x_pca)

  1. Applied Kmeans :
  • nclusters = 4 # this is the k in kmeans from above graph
  • km = KMeans(n_clusters=nclusters)
  • km.fit(x_pca)

After applying PCA and KMeans we got silhouette score of 0.47


Question 4 :

Visualize the clustering

Answer :

plotted the top most correlated features with TENURE feature.


Conclusion :

Silhouette score better with raw data , after applying PCA, Feature extraction , Kmeans silhouette score was decreased


Challenges:

NA