17 04 Discovering interpretable features - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Non-negative matrix factorization (NMF)

  • Dimension reduction technique
  • NMF models are interpretable (unlike PCA)
  • HOWEVER, can only be applied when all features are non-negative!
    • Word frequencies in each document
    • Images encoded as arrays
    • Audio spectrograms
    • Purchase histories on e-commerce site

Interpretable parts

  • NMF expresses documents as combinations of topics (or "themes")

Using scikit-learn NMF

  • Follow fit()/transform() pattern
  • Must specify n_components
  • Work with NumPy arrays and csr_matrix

NMF components & features

  • Dimension of components = dimension of samples
    • Entries are non-negative
  • NMF feature values are non-negative
    • Can be used to reconstruct the samples: samples = product of features and components
    • nmf_features = model.transform(samples)

NMF applied to Wikipedia articles

# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components = 6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)
# Import pandas
import pandas as pd

# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])
 output:
    0    0.003845
    1    0.000000
    2    0.000000
    3    0.575711
    4    0.000000
    5    0.000000
    Name: Anne Hathaway, dtype: float64
    0    0.000000
    1    0.005601
    2    0.000000
    3    0.422380
    4    0.000000
    5    0.000000
    Name: Denzel Washington, dtype: float64

02 NMF learns interpretable parts

Example: NMF learns interpretable parts

  • 20000 articles, 800 words (20000, 800)
  • NMF components are topics
    • (10, 800)
  • NMF features combine topics into documents
  • For images, NMF components are parts of images.

Explore the LED digits dataset

  • Grayscale images
    • Measure pixel brightness
    • Represent with value between 0 and 1
  • Encoding a collection of images
    • Collection of images of the same size
    • Encode as 2D array
    • Each row corresponds to an image
    • Each column corresponds to a pixel
# Import pyplot
from matplotlib import pyplot as plt

# Select the 0th row: digit
digit = samples[0,:]

# Print digit
print(digit)

# Reshape digit to a 13x8 array: bitmap
bitmap = digit.reshape(13,8)

# Print bitmap
print(bitmap)

# Use plt.imshow to display bitmap
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()  # show the digit 7

03 Building recommender systems using NMF

Strategy

  • NMF feature values describe the topic, similar documents have similar NMF feature values
  • But NMF feature values can't be compared directly.
    • Different versions of the same document have same topic proportion
    • While exact feature values may be different. (E.g. one version uses many meaningless words)
    • But all versions lie to the same line through the origin.
  • Cosine similarity
    • Uses the angle between the lines
    • Higher values means more similar ( when angle = 0, Maximum value = 1)

Calculate the cosine similarity

  • Normalize the features:
    • from sklearn.preprocessing import normalize
    • norm_features = normalize(nmf_feature)
  • Calculate the cosine similarity ( the feature DF x the feature of a certain index)
    • similarities = df.dot(current_article)
    • highest similarity : similarities.nlargest()

Recommend musical artists

# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline
import pandas as pd

# Create a MaxAbsScaler: scaler, 
# To make all users have the same influence on the model, regardless of how many different artists they've listened to
scaler = MaxAbsScaler()

# Create an NMF model: nmf
nmf = NMF(n_components=20)

# Create a Normalizer: normalizer
normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)

# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)

# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen']

# Compute cosine similarities: similarities
similarities = df.dot(artist)

# Display those with highest cosine similarity
print(similarities.nlargest())
<script.py> output:
    Bruce Springsteen    1.000000
    Neil Young           0.955896
    Van Morrison         0.872452
    Leonard Cohen        0.864763
    Bob Dylan            0.859047
    dtype: float64
⚠️ **GitHub.com Fallback** ⚠️