ICP 5 - Joshmitha307/Python GitHub Wiki

Name : Joshmitha Tammareddy

Class id : 36

Mail : [email protected]


Aim :

1.To find the correlation between Survived(target column) and Sex column. Do you think we should keep it?

2.Applying K means clustering in this data set College.csv.

3.To calculate the silhouette score for the above clustering.


Code Explanation :

Pandas library is imported. In the Sex column is female and male columns are mapped and then the corelation between the Survived and Sex columns is obtained.

Corelation means the dependency of one item on the other item in the data set. It ranges between 1 and -1. The higher the value the greater the corelation.

Kmeans is imported from the Sklearn library. iloc is used to not to show the columns which we dont require, and then the remaining columns are printed. The number of clusters taken are 3. Seed is the random state we take. The data is fit into and prediction is done.

Using data.drop we split the columns Name and Private. For the clusters in range 2 to 7 we find out the silhouette score.

Silhouette Score : is a measure of how similar an object is to its own cluster compared to other clusters . The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.


Output :

As the corelation is 0.54 which is a positive value and is a good value we can keep it.

K Means clustering is performed on the College.csv data.

For all the clusters we got silhouette score in positive values and the 2nd cluster has the highest silhouette score which means it more similar to the items in its cluster.


Conclusion :

Finding correlation, K means clustering and finding the silhouette score has been done successfully