[function] stem_tokens() - P3chys/textmining GitHub Wiki

Function: `stem_tokens()`

Purpose

Applies stemming to reduce words to their root form, normalizing different word variations for consistent text analysis.

Syntax

stem_tokens(tokens)

Parameters

Parameter	Type	Default	Description
`tokens`	list	Required	List of word tokens to be stemmed

Returns

Type: list
Content: List of stemmed tokens with words reduced to their root forms

Required Import

from nltk.stem import PorterStemmer

Stemming Algorithm

Uses Porter Stemmer: Most widely used stemming algorithm
Suffix removal: Strips common endings like -ing, -ed, -er, -est
Rule-based: Applies predetermined linguistic rules
Language: Optimized for English text

How Stemming Works

The Porter Stemmer removes common suffixes to find word roots:

Plural → Singular: "products" → "product"
Tense variations: "running", "ran", "runs" → "run"
Comparatives: "better", "best" → "better" (partial normalization)
Gerunds: "shipping" → "ship"

Usage Examples

# Basic stemming
tokens = ["running", "runner", "easily", "fairly", "products"]
stemmed = stem_tokens(tokens)
# Result: ["run", "runner", "easili", "fairli", "product"]

# Sentiment words stemming
sentiment_tokens = ["amazing", "terrible", "disappointing", "recommended"]
stemmed = stem_tokens(sentiment_tokens)
# Result: ["amaz", "terribl", "disappoint", "recommend"]

# Apply to DataFrame
df['stemmed_tokens'] = df['filtered_tokens'].apply(stem_tokens)

Advantages

Normalization: Groups related words together
Reduced vocabulary: Smaller feature space for ML models
Performance: Fast processing with simple rules
Language independence: Works without requiring word dictionaries

Limitations

Over-stemming: May remove too much (e.g., "university" → "univers")
Under-stemming: May not catch all variations
Loss of meaning: Stemmed words may be unreadable
Context insensitive: Same suffix removal regardless of word meaning

Common Stemming Results

Original	Stemmed
products, product	product
running, runner, runs	run, runner, run
better, best	better, best
amazing, amazingly	amaz, amazingli
disappointed, disappointing	disappoint, disappoint

Prerequisites

Requires NLTK library installation
Input should be filtered tokens (output from remove_stopwords())

Pipeline Position

Typically used after:

Text cleaning (clean_text())
Tokenization (tokenize_text())
Stopword removal (remove_stopwords())

Alternative: Lemmatization

Consider using lemmatization instead for:

Better readability of processed text
More accurate word normalization
Preservation of word meaning