Module 3 4 TF IDF Representation - iffatAGheyas/NLP-handbook GitHub Wiki

Module 3.4: TF–IDF Representation

TF–IDF (Term Frequency–Inverse Document Frequency) weights terms by how important they are to a document in the corpus.
image


1. Using scikit-learn’s TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Toy corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat saw the dog"
]

# Initialize and fit-transform
tfidf = TfidfVectorizer(norm='l2', use_idf=True, smooth_idf=True)
X_tfidf = tfidf.fit_transform(corpus)

# Convert to DataFrame
df = pd.DataFrame(
    X_tfidf.toarray(),
    index=[f"doc{i+1}" for i in range(len(corpus))],
    columns=tfidf.get_feature_names_out()
)

print(df.round(3))

Output:

image

2. Manual TF–IDF Computation

import math
from collections import Counter

# 1. Build document frequencies
N = len(corpus)
df = Counter()
for doc in corpus:
    terms = set(doc.split())
    for t in terms:
        df[t] += 1

# 2. Compute TF–IDF per document
def compute_tfidf(doc: str):
    tf = Counter(doc.split())
    doc_len = sum(tf.values())
    return {
        t: (tf[t]/doc_len) * math.log(N/df[t])
        for t in tf
    }

# 3. Display TF–IDF for each document
for i, doc in enumerate(corpus, 1):
    scores = compute_tfidf(doc)
    # Round for readability
    rounded = {t: round(s, 3) for t, s in scores.items()}
    print(f"doc{i} TF–IDF:", rounded)

Output:

image

3. Notes on Normalization

  • norm='l2' (default) scales each document vector to unit length.
  • smooth_idf=True adds 1 to document frequencies to avoid division by zero.
  • TF–IDF highlights terms that are frequent in a document but rare across the corpus.

Continue to 3.5 Word Embeddings (Word2Vec & GloVe)