Module 3 3 Bag of Words and Count Vectors - iffatAGheyas/NLP-handbook GitHub Wiki

Module 3.3: Bag-of-Words & Count Vectors

Bag-of-Words (BoW) represents each document as a vector of term counts (ignoring word order). The resulting Document-Term Matrix (DTM) has shape (n_docs, n_terms).


1. Using scikit-learn’s CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Toy corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat saw the dog"
]

# Initialize and fit
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Convert to DataFrame for readability
df = pd.DataFrame(
    X.toarray(),
    index=[f"doc{i+1}" for i in range(len(corpus))],
    columns=vectorizer.get_feature_names_out()
)

print(df)

Output:

image

2. Manual Construction of a Count Matrix

from collections import Counter

# Build vocabulary
vocab = sorted({word for doc in corpus for word in doc.split()})

# Function to vectorize one document
def doc_to_count_vec(doc: str):
    cnt = Counter(doc.split())
    return [cnt[word] for word in vocab]

# Build full matrix
count_matrix = [doc_to_count_vec(doc) for doc in corpus]

# Display with pandas
import pandas as pd
df_manual = pd.DataFrame(
    count_matrix,
    index=[f"doc{i+1}" for i in range(len(corpus))],
    columns=vocab
)
print(df_manual)

Output:

image

3. Notes

  • Vocabulary is typically built on the training set only.
  • Stop-word filtering, n-grams, and min/max document frequency thresholds can be applied via CountVectorizer parameters.
  • The BoW model ignores word order, syntax, and polysemy, but is simple and fast.

Continue to 3.4 TF–IDF Representation