Module 4 1 Feature Engineering n grams and POS Tags - iffatAGheyas/NLP-handbook GitHub Wiki

Module 4.1: Feature Engineering – n-grams & POS Tags

Feature engineering transforms raw text into numerical features for classical ML models. This section covers:

  1. n-grams: contiguous sequences of tokens
  2. POS tags: part-of-speech annotations as features

1. n-grams

An n-gram is a contiguous sequence of n tokens.

  • Unigrams (n=1): single words
  • Bigrams (n=2): pairs of consecutive words
  • Trigrams (n=3): triples, etc.

n-grams capture local context and collocations.

1.1. scikit-learn CountVectorizer with n-grams

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Toy corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat saw the dog"
]

# Create unigram + bigram features
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)

df = pd.DataFrame(
    X.toarray(),
    index=[f"doc{i+1}" for i in range(len(corpus))],
    columns=vectorizer.get_feature_names_out()
)

print(df)

Output:

image

1.2. Manual n-gram Extraction

from itertools import tee, islice

def extract_ngrams(tokens, n):
    """Return list of n-grams (as tuples) from a list of tokens."""
    if len(tokens) < n:
        return []
    seqs = []
    for i in range(len(tokens) - n + 1):
        seqs.append(tuple(tokens[i:i+n]))
    return seqs

tokens = "the cat sat on the mat".split()
print("Bigrams:", extract_ngrams(tokens, 2))
print("Trigrams:", extract_ngrams(tokens, 3))

Output:

image

2. Part-of-Speech (POS) Tags

POS tags label each token with its grammatical role (noun, verb, adjective, etc.). They can serve as features (e.g., POS unigrams, POS n-grams, or combined word+POS).

2.1. NLTK POS Tagging

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download once:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)

Output:

image

2.2. spaCy POS Tagging

%pip install --quiet spacy
!python -m spacy download en_core_web_sm --quiet

# 2. Import and load
import spacy
nlp = spacy.load('en_core_web_sm')

# 3. Your example
doc = nlp("The quick brown fox jumps over the lazy dog.")
print([(token.text, token.pos_) for token in doc])

Output:

image

3. Creating POS-Based Features

  • POS Unigrams: count each tag (e.g., DT, NN, VBZ) per document.
  • POS n-grams: extract tag-sequences (e.g., DT-JJ, JJ-NN).
  • Combined Word+POS: join token and tag (e.g., fox_NN).
from collections import Counter

# Example: POS unigram counts
tags = [tag for _, tag in tagged]
pos_counts = Counter(tags)
print(pos_counts)

# Example: POS bigrams
pos_bigrams = extract_ngrams(tags, 2)
print(Counter(pos_bigrams))

Output:

image

Continue to Module 4.2: Naïve Bayes Classifier