[function] lemmatize_tokens() - P3chys/textmining GitHub Wiki

Function: lemmatize_tokens()

Purpose

Applies lemmatization to reduce words to their dictionary base form (lemma), providing more accurate word normalization than stemming.

Syntax

lemmatize_tokens(tokens)

Parameters

Parameter Type Default Description
tokens list Required List of word tokens to be lemmatized

Returns

  • Type: list
  • Content: List of lemmatized tokens with words reduced to their dictionary base forms

Required Import

from nltk.stem import WordNetLemmatizer

Lemmatization Algorithm

  • Uses WordNet Lemmatizer: Dictionary-based approach
  • Morphological analysis: Considers word structure and meaning
  • POS-aware: Can use part-of-speech information for better accuracy
  • Dictionary lookup: Returns actual dictionary words (lemmas)

How Lemmatization Works

WordNet Lemmatizer uses linguistic knowledge to find proper word roots:

  • Plural → Singular: "products" → "product"
  • Verb forms: "running", "ran", "runs" → "run"
  • Adjective forms: "better" → "good", "best" → "good"
  • Preserves meaning: Returns readable dictionary words

Usage Examples

# Basic lemmatization
tokens = ["running", "runner", "easily", "fairly", "products"]
lemmatized = lemmatize_tokens(tokens)
# Result: ["running", "runner", "easily", "fairly", "product"]

# Comparison with stemming
stemmed = ["run", "runner", "easili", "fairli", "product"]
lemmatized = ["running", "runner", "easily", "fairly", "product"]

# Sentiment words
sentiment_tokens = ["amazing", "terrible", "disappointing", "recommended"]
lemmatized = lemmatize_tokens(sentiment_tokens)
# Result: ["amazing", "terrible", "disappointing", "recommended"]

# Apply to DataFrame
df['lemmatized_tokens'] = df['filtered_tokens'].apply(lemmatize_tokens)

Lemmatization vs Stemming Comparison

Aspect Stemming Lemmatization
Method Rule-based suffix removal Dictionary-based lookup
Output Word stems (may not be real words) Valid dictionary words
Accuracy Fast but less accurate Slower but more accurate
Readability Poor ("amaz", "disappoint") Excellent ("amazing", "disappointing")
Processing Speed Faster Slower

Common Lemmatization Results

Original Stemmed Lemmatized
running, runs, ran run, run, ran run, run, run
better, best better, best good, good
mice mice mouse
feet feet foot
amazing amaz amazing
children children child

Advantages

  • Readable output: Returns actual dictionary words
  • Context awareness: Better handling of irregular words
  • Meaning preservation: Maintains semantic relationships
  • Quality normalization: More accurate than stemming

Limitations

  • Slower processing: Requires dictionary lookups
  • POS dependency: Works better with part-of-speech tags
  • Language specific: Requires language-specific resources
  • Incomplete coverage: Some words may not be in dictionary

Enhanced Usage with POS Tags

# More accurate lemmatization with POS tags
import nltk
from nltk import pos_tag

def enhanced_lemmatize(tokens):
    lemmatizer = WordNetLemmatizer()
    pos_tags = pos_tag(tokens)
    
    # Map POS tags to WordNet format
    tag_map = {'N': 'n', 'V': 'v', 'R': 'r', 'J': 'a'}
    
    lemmatized = []
    for token, tag in pos_tags:
        wordnet_tag = tag_map.get(tag[0], 'n')  # Default to noun
        lemmatized.append(lemmatizer.lemmatize(token, pos=wordnet_tag))
    
    return lemmatized

Prerequisites

  • Requires NLTK library installation
  • Download required data: nltk.download('wordnet'), nltk.download('omw-1.4')
  • Input should be filtered tokens (output from remove_stopwords())

Pipeline Position

Alternative to stem_tokens() in the preprocessing pipeline:

  1. Text cleaning (clean_text())
  2. Tokenization (tokenize_text())
  3. Stopword removal (remove_stopwords())
  4. Either stemming OR lemmatization (not both)

Recommendation

Choose lemmatization when:

  • Readability of processed text is important
  • Accuracy is prioritized over speed
  • Working with smaller datasets
  • Need to preserve word meaning for interpretation