17 02 Visualization with hierarchical clustering and t SNE - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

In this chapter, you'll learn about two unsupervised learning techniques for data visualization, hierarchical clustering and t-SNE. Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.

01 Visualizing hierarchies

Hierachical clustering (of voting countries)

Every country begins in a separate cluster
At each step, the two closest clusters are merged
Continue until all countries in a single cluster
Agglomerative hierarchical cluster

Hierarchical clustering with `Scipy`

Given samples, and label_names

# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Calculate the linkage: mergings
mergings = linkage(samples, method='complete')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=varieties,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()

02 Cluster labels in hierarchical clustering

Cluster labels at any intermediate stage can be recovered
For use in e.g. cross-tabulations

Intermediate clusterings & height on dendrogram

Height = distance between merging clusters
Distance is defined by a 'linkage method'
- 'Complete' linkage: max. distance between samples (method = 'complete')
- 'Single' linkage: min. distance

Extracting cluster labels

fcluster() function
- Returns a NumPy array of cluster labels
Align cluster labels with e.g. varieties

# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)

varieties  Canadian wheat  Kama wheat  Rosa wheat
    labels                                           
    1                      14           3           0
    2                       0           0          14
    3                       0          11           0

03 t-SNE for 2-dimensional maps

t-SNE for 2-dimensional maps

t-SNE = "t-distributed stochastic neighbor embedding"
Maps samples to 2D space (or 3D)
Map approximately preserves nearness of samples
Great for inspecting datasets

t-SNE has only `fit_transform()`

Simultaneously fits the model and transforms the data
Can't extend the map to include new data samples

learning rate

Wrong choice can lead to points bunching together
Try values between 50 and 200

Different every time

The axes of a t-SNE plot do not have any interpretable meaning.
They are different every time t-SNE is applied, even on the same data

# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate = 50)

# Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1th feature: ys
ys = tsne_features[:,1]

# Scatter plot
plt.scatter(xs, ys, alpha=0.5)

# Annotate the points
for x, y, company in zip(xs, ys, companies):
    plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()

17 02 Visualization with hierarchical clustering and t SNE - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Visualizing hierarchies

Hierachical clustering (of voting countries)

Hierarchical clustering with Scipy

02 Cluster labels in hierarchical clustering

Intermediate clusterings & height on dendrogram

Extracting cluster labels

03 t-SNE for 2-dimensional maps

t-SNE for 2-dimensional maps

t-SNE has only fit_transform()

learning rate

Different every time

Hierarchical clustering with `Scipy`

t-SNE has only `fit_transform()`