[function] stem_tokens() - P3chys/textmining GitHub Wiki
Function: stem_tokens()
Purpose
Applies stemming to reduce words to their root form, normalizing different word variations for consistent text analysis.
Syntax
stem_tokens(tokens)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
tokens |
list | Required | List of word tokens to be stemmed |
Returns
- Type: list
- Content: List of stemmed tokens with words reduced to their root forms
Required Import
from nltk.stem import PorterStemmer
Stemming Algorithm
- Uses Porter Stemmer: Most widely used stemming algorithm
- Suffix removal: Strips common endings like -ing, -ed, -er, -est
- Rule-based: Applies predetermined linguistic rules
- Language: Optimized for English text
How Stemming Works
The Porter Stemmer removes common suffixes to find word roots:
- Plural → Singular: "products" → "product"
- Tense variations: "running", "ran", "runs" → "run"
- Comparatives: "better", "best" → "better" (partial normalization)
- Gerunds: "shipping" → "ship"
Usage Examples
# Basic stemming
tokens = ["running", "runner", "easily", "fairly", "products"]
stemmed = stem_tokens(tokens)
# Result: ["run", "runner", "easili", "fairli", "product"]
# Sentiment words stemming
sentiment_tokens = ["amazing", "terrible", "disappointing", "recommended"]
stemmed = stem_tokens(sentiment_tokens)
# Result: ["amaz", "terribl", "disappoint", "recommend"]
# Apply to DataFrame
df['stemmed_tokens'] = df['filtered_tokens'].apply(stem_tokens)
Advantages
- Normalization: Groups related words together
- Reduced vocabulary: Smaller feature space for ML models
- Performance: Fast processing with simple rules
- Language independence: Works without requiring word dictionaries
Limitations
- Over-stemming: May remove too much (e.g., "university" → "univers")
- Under-stemming: May not catch all variations
- Loss of meaning: Stemmed words may be unreadable
- Context insensitive: Same suffix removal regardless of word meaning
Common Stemming Results
| Original | Stemmed |
|---|---|
| products, product | product |
| running, runner, runs | run, runner, run |
| better, best | better, best |
| amazing, amazingly | amaz, amazingli |
| disappointed, disappointing | disappoint, disappoint |
Prerequisites
- Requires NLTK library installation
- Input should be filtered tokens (output from
remove_stopwords())
Pipeline Position
Typically used after:
- Text cleaning (
clean_text()) - Tokenization (
tokenize_text()) - Stopword removal (
remove_stopwords())
Alternative: Lemmatization
Consider using lemmatization instead for:
- Better readability of processed text
- More accurate word normalization
- Preservation of word meaning