17 04 Discovering interpretable features - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Non-negative matrix factorization (NMF)

Dimension reduction technique
NMF models are interpretable (unlike PCA)
HOWEVER, can only be applied when all features are non-negative!
- Word frequencies in each document
- Images encoded as arrays
- Audio spectrograms
- Purchase histories on e-commerce site

Interpretable parts

NMF expresses documents as combinations of topics (or "themes")

Using scikit-learn NMF

Follow fit()/transform() pattern
Must specify n_components
Work with NumPy arrays and csr_matrix

NMF components & features

Dimension of components = dimension of samples
- Entries are non-negative
NMF feature values are non-negative
- Can be used to reconstruct the samples: samples = product of features and components
- nmf_features = model.transform(samples)

NMF applied to Wikipedia articles

# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components = 6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Import pandas
import pandas as pd

# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])

 output:
    0    0.003845
    1    0.000000
    2    0.000000
    3    0.575711
    4    0.000000
    5    0.000000
    Name: Anne Hathaway, dtype: float64
    0    0.000000
    1    0.005601
    2    0.000000
    3    0.422380
    4    0.000000
    5    0.000000
    Name: Denzel Washington, dtype: float64

02 NMF learns interpretable parts

Example: NMF learns interpretable parts

20000 articles, 800 words (20000, 800)
NMF components are topics
- (10, 800)
NMF features combine topics into documents
For images, NMF components are parts of images.

Explore the LED digits dataset

Grayscale images
- Measure pixel brightness
- Represent with value between 0 and 1
Encoding a collection of images
- Collection of images of the same size
- Encode as 2D array
- Each row corresponds to an image
- Each column corresponds to a pixel

# Import pyplot
from matplotlib import pyplot as plt

# Select the 0th row: digit
digit = samples[0,:]

# Print digit
print(digit)

# Reshape digit to a 13x8 array: bitmap
bitmap = digit.reshape(13,8)

# Print bitmap
print(bitmap)

# Use plt.imshow to display bitmap
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()  # show the digit 7

03 Building recommender systems using NMF

Strategy

NMF feature values describe the topic, similar documents have similar NMF feature values
But NMF feature values can't be compared directly.
- Different versions of the same document have same topic proportion
- While exact feature values may be different. (E.g. one version uses many meaningless words)
- But all versions lie to the same line through the origin.
Cosine similarity
- Uses the angle between the lines
- Higher values means more similar ( when angle = 0, Maximum value = 1)

Calculate the cosine similarity

Normalize the features:
- from sklearn.preprocessing import normalize
- norm_features = normalize(nmf_feature)
Calculate the cosine similarity ( the feature DF x the feature of a certain index)
- similarities = df.dot(current_article)
- highest similarity : similarities.nlargest()

Recommend musical artists

# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline
import pandas as pd

# Create a MaxAbsScaler: scaler, 
# To make all users have the same influence on the model, regardless of how many different artists they've listened to
scaler = MaxAbsScaler()

# Create an NMF model: nmf
nmf = NMF(n_components=20)

# Create a Normalizer: normalizer
normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)

# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)

# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen']

# Compute cosine similarities: similarities
similarities = df.dot(artist)

# Display those with highest cosine similarity
print(similarities.nlargest())

<script.py> output:
    Bruce Springsteen    1.000000
    Neil Young           0.955896
    Van Morrison         0.872452
    Leonard Cohen        0.864763
    Bob Dylan            0.859047
    dtype: float64

17 04 Discovering interpretable features - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Non-negative matrix factorization (NMF)

Interpretable parts

Using scikit-learn NMF

NMF components & features

NMF applied to Wikipedia articles

02 NMF learns interpretable parts

Example: NMF learns interpretable parts

Explore the LED digits dataset

03 Building recommender systems using NMF

Strategy

Calculate the cosine similarity

Recommend musical artists

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️