17 04 Discovering interpretable features - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
01 Non-negative matrix factorization (NMF)
Dimension reduction technique
NMF models are interpretable (unlike PCA)
HOWEVER, can only be applied when all features are non-negative!
Word frequencies in each document
Images encoded as arrays
Audio spectrograms
Purchase histories on e-commerce site
Interpretable parts
NMF expresses documents as combinations of topics (or "themes")
Using scikit-learn NMF
Follow fit()/transform() pattern
Must specify n_components
Work with NumPy arrays and csr_matrix
NMF components & features
Dimension of components = dimension of samples
Entries are non-negative
NMF feature values are non-negative
Can be used to reconstruct the samples: samples = product of features and components
nmf_features = model.transform(samples)
NMF applied to Wikipedia articles
# Import NMFfromsklearn.decompositionimportNMF# Create an NMF instance: modelmodel=NMF(n_components=6)
# Fit the model to articlesmodel.fit(articles)
# Transform the articles: nmf_featuresnmf_features=model.transform(articles)
# Import pandasimportpandasaspd# Create a pandas DataFrame: dfdf=pd.DataFrame(nmf_features, index=titles)
# Print the row for 'Anne Hathaway'print(df.loc['Anne Hathaway'])
# Print the row for 'Denzel Washington'print(df.loc['Denzel Washington'])
# Import pyplotfrommatplotlibimportpyplotasplt# Select the 0th row: digitdigit=samples[0,:]
# Print digitprint(digit)
# Reshape digit to a 13x8 array: bitmapbitmap=digit.reshape(13,8)
# Print bitmapprint(bitmap)
# Use plt.imshow to display bitmapplt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show() # show the digit 7
03 Building recommender systems using NMF
Strategy
NMF feature values describe the topic, similar documents have similar NMF feature values
But NMF feature values can't be compared directly.
Different versions of the same document have same topic proportion
While exact feature values may be different. (E.g. one version uses many meaningless words)
But all versions lie to the same line through the origin.
Cosine similarity
Uses the angle between the lines
Higher values means more similar ( when angle = 0, Maximum value = 1)
Calculate the cosine similarity
Normalize the features:
from sklearn.preprocessing import normalize
norm_features = normalize(nmf_feature)
Calculate the cosine similarity ( the feature DF x the feature of a certain index)
similarities = df.dot(current_article)
highest similarity : similarities.nlargest()
Recommend musical artists
# Perform the necessary importsfromsklearn.decompositionimportNMFfromsklearn.preprocessingimportNormalizer, MaxAbsScalerfromsklearn.pipelineimportmake_pipelineimportpandasaspd# Create a MaxAbsScaler: scaler, # To make all users have the same influence on the model, regardless of how many different artists they've listened toscaler=MaxAbsScaler()
# Create an NMF model: nmfnmf=NMF(n_components=20)
# Create a Normalizer: normalizernormalizer=Normalizer()
# Create a pipeline: pipelinepipeline=make_pipeline(scaler, nmf, normalizer)
# Apply fit_transform to artists: norm_featuresnorm_features=pipeline.fit_transform(artists)
# Create a DataFrame: dfdf=pd.DataFrame(norm_features, index=artist_names)
# Select row of 'Bruce Springsteen': artistartist=df.loc['Bruce Springsteen']
# Compute cosine similarities: similaritiessimilarities=df.dot(artist)
# Display those with highest cosine similarityprint(similarities.nlargest())
<script.py> output:
Bruce Springsteen 1.000000
Neil Young 0.955896
Van Morrison 0.872452
Leonard Cohen 0.864763
Bob Dylan 0.859047
dtype: float64