Module 3 3 Bag of Words and Count Vectors - iffatAGheyas/NLP-handbook GitHub Wiki
Module 3.3: Bag-of-Words & Count Vectors
Bag-of-Words (BoW) represents each document as a vector of term counts (ignoring word order). The resulting Document-Term Matrix (DTM) has shape (n_docs, n_terms)
.
CountVectorizer
1. Using scikit-learn’s from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Toy corpus
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"the cat saw the dog"
]
# Initialize and fit
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Convert to DataFrame for readability
df = pd.DataFrame(
X.toarray(),
index=[f"doc{i+1}" for i in range(len(corpus))],
columns=vectorizer.get_feature_names_out()
)
print(df)
Output:
2. Manual Construction of a Count Matrix
from collections import Counter
# Build vocabulary
vocab = sorted({word for doc in corpus for word in doc.split()})
# Function to vectorize one document
def doc_to_count_vec(doc: str):
cnt = Counter(doc.split())
return [cnt[word] for word in vocab]
# Build full matrix
count_matrix = [doc_to_count_vec(doc) for doc in corpus]
# Display with pandas
import pandas as pd
df_manual = pd.DataFrame(
count_matrix,
index=[f"doc{i+1}" for i in range(len(corpus))],
columns=vocab
)
print(df_manual)
Output:
3. Notes
- Vocabulary is typically built on the training set only.
- Stop-word filtering, n-grams, and min/max document frequency thresholds can be applied via
CountVectorizer
parameters. - The BoW model ignores word order, syntax, and polysemy, but is simple and fast.
Continue to 3.4 TF–IDF Representation