19 04 Clustering in Real World - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Dominant colors in images

  • All images consist of pixels
  • Each pixel has three values: Red, Green and Blue
  • Pixel color: combination of these RGB values
  • Perform k-means on standardized RGB values to find cluster centers
  • Uses: Identifying features in satellite images

Tools to find dominant colors

  • Convert image to pixels: matplotlib.image.imread
  • Display colors of cluster centers: matplotlib.pyplot.imshow

Extract RGB values from image

# Import image class of matplotlib
import matplotlib.image as img

# Read batman image and print dimensions
batman_image = img.imread('batman.jpg')
print(batman_image.shape)

# Store RGB values of all pixels in lists r, g and b
for row in batman_image:
    for temp_r, temp_g, temp_b in row:
        r.append(temp_r)
        g.append(temp_g)
        b.append(temp_b)

Number of dominant colors

distortions = []
num_clusters = range(1, 7)

# Create a list of distortions from the kmeans function
for i in num_clusters:
    cluster_centers, distortion = kmeans(batman_df['scaled_red','scaled_blue','scaled_green'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'scaled_red','scaled_blue','scaled_green'),i)
    distortions.append(distortion)

# Create a data frame with two lists, num_clusters and distortions
elbow_plot = pd.DataFrame({'num_clusters':num_clusters, 'distortions':distortions})

# Create a line plot of num_clusters and distortions
sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
plt.xticks(num_clusters)
plt.show()

Display dominant colors

  • To display the dominant colors, convert the colors of the cluster centers to their raw values and then converted them to the range of 0-1.
# Get standard deviations of each color
r_std, g_std, b_std = batman_df['red', 'green', 'blue'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'red',-'green',-'blue').std()

for cluster_center in cluster_centers:
    scaled_r, scaled_g, scaled_b = cluster_center
    # Convert each standardized value to scaled value
    colors.append((
        scaled_r * r_std / 255,
        scaled_g * g_std / 255,
        scaled_b * b_std / 255
    ))

# Display colors of cluster centers
plt.imshow([colors])
plt.show()

02 Document clustering

Steps

  1. Clean data before processing
  2. Determine the importance of the terms in a document (in TF-IDF matrix)
  3. Cluster the TF-IDF matrix
  4. Find top terms, documents in each cluster.

Clean and tokenize data

  • Convert text into smaller parts called tokens, clean data for processing
from nltk.tokenize import word_tokenize
import re

def remove_noise(text, stop_words=[]):
    tokens = word_tokenize(text)
    cleaned_tokens = []
    for token in tokens:
        token = re.sub('[^A-Za-z0-9]+', '', token)
        if len(token) > 1 and token.lower() not in stop_words:
           # Get lowercase
           cleaned_tokens.append(token.lower())
    return cleaned_tokens

Document term matrix and sparse matrices & TF-IDF

# Import TfidfVectorizer class from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_features=50,  max_df=0.75, tokenizer=remove_noise)

# Use the .fit_transform() method on the list plots
tfidf_matrix = tfidf_vectorizer.fit_transform(plots)

Clustering with sparse matrix

  • kmeans() in SciPy does not support sparse matrices
  • Use .todense() to convert to a matrix

Top terms per cluster

  • Cluster centers: lists with a size equal to the number of terms
  • Each value in the cluster center is its importance
  • Create a dictionary and print top terms
num_clusters = 2

# Generate cluster centers through the kmeans function
cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)

# Generate terms from the tfidf_vectorizer object
terms = tfidf_vectorizer.get_feature_names()

for i in range(num_clusters):
    # Sort the terms and print top 3 terms
    center_terms = dict(zip(terms, list(cluster_centers[i])))
    sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
    print(sorted_terms)

More considerations

  • Work with hyperlinks, emotions etc.
  • Normalize words
  • .todense() may not work with large datasets

03 Clustering with multiple features

Basic checks

  • Check how the cluster centers vary with respect to the overall data. - If you notice that cluster centers of some features do not vary significantly with respect to the overall data, perhaps, it is an indication that you can drop that feature in the next run.
print(fifa.groupby('cluster_labels)['scaled_heading_accuracy','scaled_volleys','scaled_finishing'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'scaled_heading_accuracy','scaled_volleys','scaled_finishing').mean())
  • Look at the sizes of the clusters formed. - If one or more clusters are significantly smaller than the rest, you may want to double if their cluster centers are similar to other clusters.
fifa.groupby('cluster_labels`)[scaled_features].mean().plot(kind='bar')

Top items in clusters

for cluster in fifa['cluster_labels'].unique():
    print(cluster, fifa[fifa['cluster_labels'] == cluster]['names'].values[:5])