Module 6 Topic Modeling and Document Clustering - iffatAGheyas/applied-nlp-handbook GitHub Wiki
Module 6: Topic Modeling & Document Clustering
This module shows how to discover latent topics in a collection of documents and how to cluster documents based on their content.
6.1 Topic Modeling with LDA (gensim)
Latent Dirichlet Allocation (LDA) is a generative probabilistic model that represents each document as a mixture of topics, and each topic as a distribution over words.
import gensim
from gensim import corpora
from pprint import pprint
# 1. Toy corpus
docs = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user-perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Width of trees and well quasi ordering",
"Graph minors A survey",
]
# 2. Preprocess: simple tokenization & lowercasing
texts = [[w.lower() for w in doc.split()] for doc in docs]
# 3. Build dictionary & corpus
dictionary = corpora.Dictionary(texts)
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# 4. Train LDA model: 2 topics
lda = gensim.models.LdaModel(
corpus=corpus_bow,
id2word=dictionary,
num_topics=2,
random_state=42,
passes=10
)
# 5. Print topics
pprint(lda.print_topics(num_words=5))
Output:
[(0,
'0.070*"graph" + 0.051*"trees" + 0.049*"minors" + 0.049*"of" + '
'0.031*"interface"'),
(1,
'0.094*"of" + 0.076*"system" + 0.042*"response" + 0.042*"time" + '
'0.042*"user"')]
The lda.print_topics()
output shows two topics, each represented by its top words and their weights (i.e., the probability of that word in the topic):
-
Topic 0
- Top terms:
graph
(0.070),trees
(0.051),minors
(0.049),of
(0.049),interface
(0.031) - Interpretation: This topic clusters documents about graph-theoretic concepts and tree structures (e.g., “Graph minors”, “trees”), with “interface” pulled in from phrases like “interface of EPS”.
- Top terms:
-
Topic 1
- Top terms:
of
(0.094),system
(0.076),response
(0.042),time
(0.042),user
(0.042) - Interpretation: This topic centers on system performance and user-centric discussions (e.g., “system response time”, “user opinion”).
- Top terms:
Note: The high weight on the stop-word “of” indicates that stop-word removal can help produce cleaner, more coherent topics.
6.2 Interactive Topic Visualization (pyLDAvis)
After training an LDA model, the pyLDAvis toolkit provides an interactive, browser-based view of your topics, as shown in the image below. Here’s how to interpret the three main panels:
# 1. Install pyLDAvis into this notebook’s environment
%pip install --quiet pyLDAvis
# 2. Imports
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
# 3. Prepare the LDA vis data
# - lda: your trained Gensim LdaModel
# - corpus_bow: your corpus in BoW format
# - dictionary: your Gensim Dictionary
vis_data = gensimvis.prepare(lda, corpus_bow, dictionary)
# 4. Display inline in Jupyter
pyLDAvis.display(vis_data)
# 5. Optionally, save to HTML for embedding or sharing
# (make sure the 'images/' folder exists or change the path)
pyLDAvis.save_html(vis_data, 'images/module6_2_pyldavis.html')
print("pyLDAvis visualization ready (and saved to images/module6_2_pyldavis.html)")
1. Intertopic Distance Map
Intertopic Distance Map
-
Circles
Each circle represents a topic. The number inside is the topic ID (e.g. “1”, “2”). -
Position
Circles are positioned via multidimensional scaling on their word distributions:- Close together → topics share many high-probability words
- Far apart → topics are more distinct
-
Size
Proportional to the topic’s overall prevalence in the corpus (larger circle → more tokens assigned to that topic).
2. Top-Terms Bar Chart
Top Terms for Topic 2
Once a topic is selected (here, Topic 2), three visual cues appear:
-
Light-blue bars
Indicate each term’s overall frequency in the entire corpus, (P(w)). -
Red bars
Indicate the estimated frequency of that term within the selected topic, (P(w \mid \text{topic})). -
Dark-blue highlighting on the term labels
Marks the top-(K) terms ranked by the current relevance metric. The λ slider adjusts the balance between raw frequency and topical lift:
[ \text{relevance}(w, t) = \lambda ,P(w \mid t);+;(1-\lambda),\frac{P(w \mid t)}{P(w)}. ]- At the default (\lambda=0.33), there is a balance between general frequency and topic-specific distinctiveness.
3. Marginal Topic Distribution
!Marginal Topic Distribution Legend
- Displays a bubble legend (2%, 5%, 10%) to indicate what circle-sizes correspond to in terms of token share.
- Use this to gauge whether a topic is very common or represents only a small slice of the corpus.
How to use this visualisation
- Explore topic similarity on the distance map—identify overlapping themes or clear outliers.
- Select a topic (via the text box or “Previous/Next Topic” buttons) to update the Top-Terms view.
- Adjust λ to shift focus between high-frequency words and those with high lift (uniquely topical).
- Export the full interactive HTML (
module6_2_pyldavis.html
) to share or embed in your repository/wiki.
This interactive visualisation helps us understand how topics relate, which words define each one, and how prominent they are in the collection.
6.3 Document Clustering with KMeans
Cluster documents in TF–IDF space using K-Means.
# kmeans_clustering_demo.ipynb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# 0. Define your corpus of documents
docs = [
"The cat sat on the mat.",
"Dogs and cats living together.",
"The quick brown fox jumps over the lazy dog.",
"I love my pet cat.",
"My neighbour has three dogs.",
"Foxes are wild animals."
]
# 1. Vectorise with TF–IDF (automatically strips English stopwords)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
# 2. Fit KMeans with 2 clusters
km = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = km.fit_predict(X)
# 3. View cluster assignments
for i, (doc, label) in enumerate(zip(docs, clusters)):
snippet = doc if len(doc) < 50 else doc[:47] + "..."
print(f"Doc {i:2d} → Cluster {label}: {snippet}")
# 4. (Optional) Inspect top terms per cluster
print("\nTop terms per cluster:")
terms = vectorizer.get_feature_names_out()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for label in range(km.n_clusters):
top_terms = [terms[idx] for idx in order_centroids[label, :5]]
print(f" Cluster {label}: {', '.join(top_terms)}")
Output:
Doc 0 → Cluster 0: The cat sat on the mat.
Doc 1 → Cluster 1: Dogs and cats living together.
Doc 2 → Cluster 0: The quick brown fox jumps over the lazy dog.
Doc 3 → Cluster 0: I love my pet cat.
Doc 4 → Cluster 1: My neighbour has three dogs.
Doc 5 → Cluster 0: Foxes are wild animals.
Top terms per cluster:
Cluster 0: cat, sat, pet, mat, love
Cluster 1: dogs, neighbour, cats, living, wild
6.4 Evaluation Metrics
Topic Coherence (gensim)
# lda_coherence_demo.ipynb
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
import nltk
# --- 0. Example raw documents (replace these with your own preprocessed texts) ---
raw_docs = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement"
]
# --- 1. Tokenise / preprocess (very basic) ---
texts = [
[word.lower() for word in nltk.word_tokenize(doc) if word.isalpha()]
for doc in raw_docs
]
# --- 2. Build the dictionary and BoW corpus ---
dictionary = corpora.Dictionary(texts)
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# --- 3. Train an LDA model ---
lda = models.LdaModel(
corpus=corpus_bow,
id2word=dictionary,
num_topics=3,
random_state=42,
update_every=1,
passes=10,
alpha='auto',
per_word_topics=True
)
# --- 4. Compute coherence using the c_v metric ---
coherence_model = CoherenceModel(
model=lda,
texts=texts,
dictionary=dictionary,
coherence='c_v'
)
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score (c_v): {coherence_score:.4f}")
Output:
Coherence Score (c_v): 0.3562
Clustering Silhouette Score
from sklearn.metrics import silhouette_score
sil_score = silhouette_score(X, clusters)
print("Silhouette Score:", sil_score)
Output:
Silhouette Score: 0.07305875685814943
6.5 Alternative: NMF for Topic Extraction
Non-negative Matrix Factorization (NMF) can also extract topics:
from sklearn.decomposition import NMF
# 1. Fit NMF with 2 components
nmf = NMF(n_components=2, random_state=42)
W = nmf.fit_transform(X) # document-topic matrix
H = nmf.components_ # topic-word matrix
# 2. Display top words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
top_words = [feature_names[i] for i in topic.argsort()[-5:][::-1]]
print(f"Topic {topic_idx}:", top_words)
Output:
Topic 0: ['dogs', 'neighbour', 'cats', 'living', 'jumps']
Topic 1: ['cat', 'sat', 'pet', 'mat', 'love']