Module 3 4 TF IDF Representation - iffatAGheyas/NLP-handbook GitHub Wiki
Module 3.4: TF–IDF Representation
TF–IDF (Term Frequency–Inverse Document Frequency) weights terms by how important they are to a document in the corpus.
TfidfVectorizer
1. Using scikit-learn’s from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# Toy corpus
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"the cat saw the dog"
]
# Initialize and fit-transform
tfidf = TfidfVectorizer(norm='l2', use_idf=True, smooth_idf=True)
X_tfidf = tfidf.fit_transform(corpus)
# Convert to DataFrame
df = pd.DataFrame(
X_tfidf.toarray(),
index=[f"doc{i+1}" for i in range(len(corpus))],
columns=tfidf.get_feature_names_out()
)
print(df.round(3))
Output:
2. Manual TF–IDF Computation
import math
from collections import Counter
# 1. Build document frequencies
N = len(corpus)
df = Counter()
for doc in corpus:
terms = set(doc.split())
for t in terms:
df[t] += 1
# 2. Compute TF–IDF per document
def compute_tfidf(doc: str):
tf = Counter(doc.split())
doc_len = sum(tf.values())
return {
t: (tf[t]/doc_len) * math.log(N/df[t])
for t in tf
}
# 3. Display TF–IDF for each document
for i, doc in enumerate(corpus, 1):
scores = compute_tfidf(doc)
# Round for readability
rounded = {t: round(s, 3) for t, s in scores.items()}
print(f"doc{i} TF–IDF:", rounded)
Output:
3. Notes on Normalization
norm='l2'
(default) scales each document vector to unit length.smooth_idf=True
adds 1 to document frequencies to avoid division by zero.- TF–IDF highlights terms that are frequent in a document but rare across the corpus.
Continue to 3.5 Word Embeddings (Word2Vec & GloVe)