17 01 Clustering for dataset exploration - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Unsupervised Learning

  • finds patterns in data
    • e.g. clustering customers by their purchases
  • Compressing the data using purchase patterns (dimension reduction)
  • without a specific prediction task in mind
  • Dimension = number of features

k-means clustering

  • Finds clusters of samples
  • Number of clusters must be specified
  • Implemented in sklearn
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters = 3)

# Fit model to points
model.fit(points)

Cluster labels for new samples

  • New samples can be assigned to existing clusters
  • k-means remembers the mean of each cluster (the "centroids")
    • centroids = model.cluster_centers_
  • Finds the nearest centroid to each new sample
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()

02 Evaluating a clustering

Check correspondence

  • pd.crosstabl()
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])

# Display ct
print(ct)

Measuring clustering quality

  • Information such as species not available
    • Using only samples and their cluster labels
    • A good clustering has tight clusters
  • Inertia measures clustering quality (LOWER is better)
    • Distance from each sample to the centroid of its cluster
    • .intertia_
  • More clusters means lower inertia
    • but not too many
    • Choose an 'elbow' in the inertia plot where inertia begins to decrease more slowly
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters = k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

03 Transforming features for better clustering

Feature variances

  • In kmeans: feature variance = feature influence
  • StandardScaler transforms each feature to have mean 0 and variance 1
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters = 4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels' : labels, 'species' : species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)
  • Note that Normalizer() is different to StandardScaler(). While StandardScaler() standardizes features (such as the features of the fish data) by removing the mean and scaling to unit variance, Normalizer() rescales each sample - here, each company's stock price - independently of the other.
    • Normalizer acts row-wise and StandardScaler column-wise. Normalizer does not remove the mean and scale by deviation but scales the whole row to unit norm.