Module 6 Topic Modeling and Document Clustering - iffatAGheyas/applied-nlp-handbook GitHub Wiki

Module 6: Topic Modeling & Document Clustering

This module shows how to discover latent topics in a collection of documents and how to cluster documents based on their content.

6.1 Topic Modeling with LDA (gensim)

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that represents each document as a mixture of topics, and each topic as a distribution over words.

import gensim
from gensim import corpora
from pprint import pprint

# 1. Toy corpus
docs = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user-perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Width of trees and well quasi ordering",
    "Graph minors A survey",
]

# 2. Preprocess: simple tokenization & lowercasing
texts = [[w.lower() for w in doc.split()] for doc in docs]

# 3. Build dictionary & corpus
dictionary = corpora.Dictionary(texts)
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# 4. Train LDA model: 2 topics
lda = gensim.models.LdaModel(
    corpus=corpus_bow,
    id2word=dictionary,
    num_topics=2,
    random_state=42,
    passes=10
)

# 5. Print topics
pprint(lda.print_topics(num_words=5))

Output:

[(0,
  '0.070*"graph" + 0.051*"trees" + 0.049*"minors" + 0.049*"of" + '
  '0.031*"interface"'),
 (1,
  '0.094*"of" + 0.076*"system" + 0.042*"response" + 0.042*"time" + '
  '0.042*"user"')]

The lda.print_topics() output shows two topics, each represented by its top words and their weights (i.e., the probability of that word in the topic):

Topic 0
- Top terms: graph (0.070), trees (0.051), minors (0.049), of (0.049), interface (0.031)
- Interpretation: This topic clusters documents about graph-theoretic concepts and tree structures (e.g., “Graph minors”, “trees”), with “interface” pulled in from phrases like “interface of EPS”.
Topic 1
- Top terms: of (0.094), system (0.076), response (0.042), time (0.042), user (0.042)
- Interpretation: This topic centers on system performance and user-centric discussions (e.g., “system response time”, “user opinion”).

Note: The high weight on the stop-word “of” indicates that stop-word removal can help produce cleaner, more coherent topics.

6.2 Interactive Topic Visualization (pyLDAvis)

After training an LDA model, the pyLDAvis toolkit provides an interactive, browser-based view of your topics, as shown in the image below. Here’s how to interpret the three main panels:

# 1. Install pyLDAvis into this notebook’s environment
%pip install --quiet pyLDAvis

# 2. Imports
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# 3. Prepare the LDA vis data
#    - lda: your trained Gensim LdaModel
#    - corpus_bow: your corpus in BoW format
#    - dictionary: your Gensim Dictionary
vis_data = gensimvis.prepare(lda, corpus_bow, dictionary)

# 4. Display inline in Jupyter
pyLDAvis.display(vis_data)

# 5. Optionally, save to HTML for embedding or sharing
#    (make sure the 'images/' folder exists or change the path)
pyLDAvis.save_html(vis_data, 'images/module6_2_pyldavis.html')

print("pyLDAvis visualization ready (and saved to images/module6_2_pyldavis.html)")

1. Intertopic Distance Map

Intertopic Distance Map

Circles
Each circle represents a topic. The number inside is the topic ID (e.g. “1”, “2”).
Position
Circles are positioned via multidimensional scaling on their word distributions:
- Close together → topics share many high-probability words
- Far apart → topics are more distinct
Size
Proportional to the topic’s overall prevalence in the corpus (larger circle → more tokens assigned to that topic).

2. Top-Terms Bar Chart

3. Marginal Topic Distribution

!Marginal Topic Distribution Legend

Displays a bubble legend (2%, 5%, 10%) to indicate what circle-sizes correspond to in terms of token share.
Use this to gauge whether a topic is very common or represents only a small slice of the corpus.

How to use this visualisation

Explore topic similarity on the distance map—identify overlapping themes or clear outliers.
Select a topic (via the text box or “Previous/Next Topic” buttons) to update the Top-Terms view.
Adjust λ to shift focus between high-frequency words and those with high lift (uniquely topical).
Export the full interactive HTML (module6_2_pyldavis.html) to share or embed in your repository/wiki.

This interactive visualisation helps us understand how topics relate, which words define each one, and how prominent they are in the collection.

6.3 Document Clustering with KMeans

Cluster documents in TF–IDF space using K-Means.

# kmeans_clustering_demo.ipynb

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# 0. Define your corpus of documents
docs = [
    "The cat sat on the mat.",
    "Dogs and cats living together.",
    "The quick brown fox jumps over the lazy dog.",
    "I love my pet cat.",
    "My neighbour has three dogs.",
    "Foxes are wild animals."
]

# 1. Vectorise with TF–IDF (automatically strips English stopwords)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

# 2. Fit KMeans with 2 clusters
km = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = km.fit_predict(X)

# 3. View cluster assignments
for i, (doc, label) in enumerate(zip(docs, clusters)):
    snippet = doc if len(doc) < 50 else doc[:47] + "..."
    print(f"Doc {i:2d} → Cluster {label}: {snippet}")

# 4. (Optional) Inspect top terms per cluster
print("\nTop terms per cluster:")
terms = vectorizer.get_feature_names_out()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for label in range(km.n_clusters):
    top_terms = [terms[idx] for idx in order_centroids[label, :5]]
    print(f"  Cluster {label}: {', '.join(top_terms)}")

Output:

Doc  0 → Cluster 0: The cat sat on the mat.
Doc  1 → Cluster 1: Dogs and cats living together.
Doc  2 → Cluster 0: The quick brown fox jumps over the lazy dog.
Doc  3 → Cluster 0: I love my pet cat.
Doc  4 → Cluster 1: My neighbour has three dogs.
Doc  5 → Cluster 0: Foxes are wild animals.

Top terms per cluster:
  Cluster 0: cat, sat, pet, mat, love
  Cluster 1: dogs, neighbour, cats, living, wild

6.4 Evaluation Metrics

Topic Coherence (gensim)

# lda_coherence_demo.ipynb

from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
import nltk

# --- 0. Example raw documents (replace these with your own preprocessed texts) ---
raw_docs = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement"
]

# --- 1. Tokenise / preprocess (very basic) ---
texts = [
    [word.lower() for word in nltk.word_tokenize(doc) if word.isalpha()]
    for doc in raw_docs
]

# --- 2. Build the dictionary and BoW corpus ---
dictionary = corpora.Dictionary(texts)
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# --- 3. Train an LDA model ---
lda = models.LdaModel(
    corpus=corpus_bow,
    id2word=dictionary,
    num_topics=3,
    random_state=42,
    update_every=1,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

# --- 4. Compute coherence using the c_v metric ---
coherence_model = CoherenceModel(
    model=lda,
    texts=texts,
    dictionary=dictionary,
    coherence='c_v'
)
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score (c_v): {coherence_score:.4f}")

Output:

Coherence Score (c_v): 0.3562

Clustering Silhouette Score

from sklearn.metrics import silhouette_score

sil_score = silhouette_score(X, clusters)
print("Silhouette Score:", sil_score)

Output:

Silhouette Score: 0.07305875685814943

6.5 Alternative: NMF for Topic Extraction

Non-negative Matrix Factorization (NMF) can also extract topics:

from sklearn.decomposition import NMF

# 1. Fit NMF with 2 components
nmf = NMF(n_components=2, random_state=42)
W = nmf.fit_transform(X)      # document-topic matrix
H = nmf.components_           # topic-word matrix

# 2. Display top words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
    top_words = [feature_names[i] for i in topic.argsort()[-5:][::-1]]
    print(f"Topic {topic_idx}:", top_words)

Output:

Topic 0: ['dogs', 'neighbour', 'cats', 'living', 'jumps']
Topic 1: ['cat', 'sat', 'pet', 'mat', 'love']