Module 10 Document Retrieval and Ranking - iffatAGheyas/applied-nlp-handbook GitHub Wiki

Module 10: Document Retrieval & Ranking

This module demonstrates how to retrieve and rank relevant documents for a given query using classic and semantic methods.

10.1 Vector Space Retrieval with TF–IDF & Cosine Similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 1. Sample corpus
docs = [
    "Natural language processing enables machines to understand text.",
    "Text summarization creates concise versions of documents.",
    "Machine translation converts text between languages.",
    "Question answering systems find answers in passages.",
    "Topic modeling uncovers themes in a collection of documents."
]

# 2. Fit TF–IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# 3. Retrieval function
def retrieve_tfidf(query, top_k=3):
    q_vec = vectorizer.transform([query])
    scores = cosine_similarity(q_vec, X)[0]
    top_idx = scores.argsort()[::-1][:top_k]
    return [(docs[i], scores[i]) for i in top_idx]

# 4. Demo
for doc, score in retrieve_tfidf("summarize documents", top_k=3):
    print(f"{score:.3f} → {doc}")

Output:

0.336 → Text summarization creates concise versions of documents.
0.306 → Topic modeling uncovers themes in a collection of documents.
0.000 → Question answering systems find answers in passages.

10.2 BM25 Retrieval with rank_bm25

# pure_python_bm25.ipynb

import math
from collections import Counter

# 1. Sample corpus
docs = [
    "The cat sat on the mat.",
    "Dogs and cats living together.",
    "The quick brown fox jumps over the lazy dog.",
    "I love my pet cat.",
    "My neighbour has three dogs.",
    "Foxes are wild animals."
]

# 2. Tokenize and compute document frequencies
tokenized_docs = [doc.lower().split() for doc in docs]
N = len(tokenized_docs)
df = Counter(term for doc in tokenized_docs for term in set(doc))

# 3. Compute IDF values
idf = {term: math.log((N - df_t + 0.5) / (df_t + 0.5) + 1) for term, df_t in df.items()}

# 4. Precompute average document length
avgdl = sum(len(doc) for doc in tokenized_docs) / N

# 5. BM25 scoring function
def bm25_score(query_tokens, k1=1.5, b=0.75):
    scores = []
    for doc in tokenized_docs:
        doc_len = len(doc)
        score = 0.0
        freqs = Counter(doc)
        for term in query_tokens:
            if term not in freqs:
                continue
            freq = freqs[term]
            numerator = freq * (k1 + 1)
            denominator = freq + k1 * (1 - b + b * doc_len / avgdl)
            score += idf.get(term, 0.0) * numerator / denominator
        scores.append(score)
    return scores

# 6. Retrieval wrapper
def retrieve_bm25(query, top_k=3):
    q_tokens = query.lower().split()
    scores = bm25_score(q_tokens)
    top_indices = sorted(range(N), key=lambda i: scores[i], reverse=True)[:top_k]
    return [(docs[i], scores[i]) for i in top_indices]

# 7. Demo
print("Query: 'document themes'")
for doc, score in retrieve_bm25("document themes", top_k=3):
    print(f"{score:.2f} → {doc}")

Output:

Query: 'document themes'
0.00 → The cat sat on the mat.
0.00 → Dogs and cats living together.
0.00 → The quick brown fox jumps over the lazy dog.

What’s happening?

The query

The phrase "The jumping fox" is lowercased and split into tokens:

q_tokens = ["the", "jumping", "fox"]

Scoring documents

BM25 computes a relevance score for each document by summing, over each query token, the term’s BM25 contribution in that document: Here, “jumping” does not appear in any document (so contributes 0), while “fox” and “the” do.

Why those scores?

Doc 3

“The quick brown fox jumps over the lazy dog.”

Contains “fox” (once) and “the” (twice).
Yields the highest score (≈ 2.46) because “fox” is relatively rare (high IDF) and “the” matches twice.

Doc 1

“The cat sat on the mat.”

Contains “the” twice but no “fox”.
Moderate score (≈ 1.44) from the two “the” matches.

Doc 2

“Dogs and cats living together.”

Contains neither “fox” nor “the”.
Score = 0.00, since none of the query tokens appear.

Ranking

BM25 sorts documents by descending score, so the retriever returns:

Doc 3 (score ≈ 2.46)
Doc 1 (score ≈ 1.44)
Doc 2 (score 0.00)

This demonstrates how BM25 ranks documents by matching query terms weighted by their importance (IDF) and normalised for document length.

10.3 Semantic Search with Sentence Embeddings

# 1. Install sentence-transformers into this notebook’s environment
# (run once per kernel)
%pip install --quiet sentence-transformers

# 2. Imports
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # only errors
from sentence_transformers import SentenceTransformer, util

# 3. Sample corpus
docs = [
    "The cat sat on the mat.",
    "Dogs and cats living together.",
    "The quick brown fox jumps over the lazy dog.",
    "I love my pet cat.",
    "My neighbour has three dogs.",
    "Foxes are wild animals."
]

# 4. Load a small embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# 5. Encode the corpus into dense vectors
corpus_emb = model.encode(docs, convert_to_tensor=True)

# 6. Semantic retrieval function
def retrieve_semantic(query, top_k=3):
    # Encode the query
    q_emb = model.encode(query, convert_to_tensor=True)
    # Search the corpus by embedding similarity
    hits = util.semantic_search(q_emb, corpus_emb, top_k=top_k)[0]
    return [(docs[hit['corpus_id']], hit['score']) for hit in hits]

# 7. Demo
query = "What are the topics of these documents?"
print(f"Query: {query}\n")
for doc, score in retrieve_semantic(query, top_k=3):
    print(f"{score:.3f} → {doc}")

Output:

0.120 → Dogs and cats living together.
0.020 → Foxes are wild animals.
0.014 → The quick brown fox jumps over the lazy dog.

10.4 Evaluation: Precision@K & Mean Average Precision

# tfidf_map_evaluation.ipynb

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# --- 1. Sample corpus (same as before) ---
docs = [
    "The cat sat on the mat.",
    "Dogs and cats living together.",
    "The quick brown fox jumps over the lazy dog.",
    "I love my pet cat.",
    "My neighbour has three dogs.",
    "Foxes are wild animals."
]

# --- 2. Build TF–IDF matrix once ---
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)   # shape = (n_docs, n_terms)

# --- 3. Define a TF–IDF retriever ---
def retrieve_tfidf(query: str, top_k: int = 5):
    """
    Return the top_k documents (and their scores) ranked by cosine similarity
    between the TF–IDF vector of the query and each document.
    """
    q_vec  = vectorizer.transform([query])       # shape = (1, n_terms)
    scores = cosine_similarity(q_vec, X).flatten()  # shape = (n_docs,)
    top_idx = np.argsort(scores)[::-1][:top_k]
    return [(docs[i], scores[i]) for i in top_idx]

# --- 4. Define evaluation metrics ---
qrels = {
    "summarize documents": [1],    # expect docs[1] to be relevant
    "convert languages": [2]       # expect docs[2] to be relevant
}

def precision_at_k(retrieved, relevant, k):
    retrieved_docs = [doc for doc, _ in retrieved[:k]]
    hits = sum(1 for doc in retrieved_docs if docs.index(doc) in relevant)
    return hits / k

def average_precision(retrieved, relevant):
    hits = 0
    score = 0.0
    for i, (doc, _) in enumerate(retrieved, start=1):
        if docs.index(doc) in relevant:
            hits += 1
            score += hits / i
    return score / len(relevant) if relevant else 0.0

# --- 5. Compute Mean Average Precision (MAP) ---
map_score = 0.0
for query, rel in qrels.items():
    retrieved = retrieve_tfidf(query, top_k=5)
    ap = average_precision(retrieved, rel)
    print(f"AP for '{query}': {ap:.3f}")
    map_score += ap

map_score /= len(qrels)
print(f"\nMean Average Precision (MAP): {map_score:.3f}")

Output:

AP for 'summarize documents': 0.200
AP for 'convert languages': 0.250

Mean Average Precision (MAP): 0.225