[function] stem_tokens() - P3chys/textmining GitHub Wiki

Function: stem_tokens()

Purpose

Applies stemming to reduce words to their root form, normalizing different word variations for consistent text analysis.

Syntax

stem_tokens(tokens)

Parameters

Parameter Type Default Description
tokens list Required List of word tokens to be stemmed

Returns

  • Type: list
  • Content: List of stemmed tokens with words reduced to their root forms

Required Import

from nltk.stem import PorterStemmer

Stemming Algorithm

  • Uses Porter Stemmer: Most widely used stemming algorithm
  • Suffix removal: Strips common endings like -ing, -ed, -er, -est
  • Rule-based: Applies predetermined linguistic rules
  • Language: Optimized for English text

How Stemming Works

The Porter Stemmer removes common suffixes to find word roots:

  • Plural → Singular: "products" → "product"
  • Tense variations: "running", "ran", "runs" → "run"
  • Comparatives: "better", "best" → "better" (partial normalization)
  • Gerunds: "shipping" → "ship"

Usage Examples

# Basic stemming
tokens = ["running", "runner", "easily", "fairly", "products"]
stemmed = stem_tokens(tokens)
# Result: ["run", "runner", "easili", "fairli", "product"]

# Sentiment words stemming
sentiment_tokens = ["amazing", "terrible", "disappointing", "recommended"]
stemmed = stem_tokens(sentiment_tokens)
# Result: ["amaz", "terribl", "disappoint", "recommend"]

# Apply to DataFrame
df['stemmed_tokens'] = df['filtered_tokens'].apply(stem_tokens)

Advantages

  • Normalization: Groups related words together
  • Reduced vocabulary: Smaller feature space for ML models
  • Performance: Fast processing with simple rules
  • Language independence: Works without requiring word dictionaries

Limitations

  • Over-stemming: May remove too much (e.g., "university" → "univers")
  • Under-stemming: May not catch all variations
  • Loss of meaning: Stemmed words may be unreadable
  • Context insensitive: Same suffix removal regardless of word meaning

Common Stemming Results

Original Stemmed
products, product product
running, runner, runs run, runner, run
better, best better, best
amazing, amazingly amaz, amazingli
disappointed, disappointing disappoint, disappoint

Prerequisites

  • Requires NLTK library installation
  • Input should be filtered tokens (output from remove_stopwords())

Pipeline Position

Typically used after:

  1. Text cleaning (clean_text())
  2. Tokenization (tokenize_text())
  3. Stopword removal (remove_stopwords())

Alternative: Lemmatization

Consider using lemmatization instead for:

  • Better readability of processed text
  • More accurate word normalization
  • Preservation of word meaning