Module 4 1 Feature Engineering n grams and POS Tags - iffatAGheyas/NLP-handbook GitHub Wiki
Module 4.1: Feature Engineering – n-grams & POS Tags
Feature engineering transforms raw text into numerical features for classical ML models. This section covers:
- n-grams: contiguous sequences of tokens
- POS tags: part-of-speech annotations as features
1. n-grams
An n-gram is a contiguous sequence of n tokens.
- Unigrams (n=1): single words
- Bigrams (n=2): pairs of consecutive words
- Trigrams (n=3): triples, etc.
n-grams capture local context and collocations.
CountVectorizer
with n-grams
1.1. scikit-learn from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Toy corpus
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"the cat saw the dog"
]
# Create unigram + bigram features
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame(
X.toarray(),
index=[f"doc{i+1}" for i in range(len(corpus))],
columns=vectorizer.get_feature_names_out()
)
print(df)
Output:
1.2. Manual n-gram Extraction
from itertools import tee, islice
def extract_ngrams(tokens, n):
"""Return list of n-grams (as tuples) from a list of tokens."""
if len(tokens) < n:
return []
seqs = []
for i in range(len(tokens) - n + 1):
seqs.append(tuple(tokens[i:i+n]))
return seqs
tokens = "the cat sat on the mat".split()
print("Bigrams:", extract_ngrams(tokens, 2))
print("Trigrams:", extract_ngrams(tokens, 3))
Output:
2. Part-of-Speech (POS) Tags
POS tags label each token with its grammatical role (noun, verb, adjective, etc.). They can serve as features (e.g., POS unigrams, POS n-grams, or combined word+POS).
2.1. NLTK POS Tagging
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
# Download once:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
Output:
2.2. spaCy POS Tagging
%pip install --quiet spacy
!python -m spacy download en_core_web_sm --quiet
# 2. Import and load
import spacy
nlp = spacy.load('en_core_web_sm')
# 3. Your example
doc = nlp("The quick brown fox jumps over the lazy dog.")
print([(token.text, token.pos_) for token in doc])
Output:
3. Creating POS-Based Features
- POS Unigrams: count each tag (e.g.,
DT
,NN
,VBZ
) per document. - POS n-grams: extract tag-sequences (e.g.,
DT-JJ
,JJ-NN
). - Combined Word+POS: join token and tag (e.g.,
fox_NN
).
from collections import Counter
# Example: POS unigram counts
tags = [tag for _, tag in tagged]
pos_counts = Counter(tags)
print(pos_counts)
# Example: POS bigrams
pos_bigrams = extract_ngrams(tags, 2)
print(Counter(pos_bigrams))
Output:
Continue to Module 4.2: Naïve Bayes Classifier