[function] lemmatize_tokens() - P3chys/textmining GitHub Wiki

Function: `lemmatize_tokens()`

Purpose

Applies lemmatization to reduce words to their dictionary base form (lemma), providing more accurate word normalization than stemming.

Syntax

lemmatize_tokens(tokens)

Parameters

Parameter	Type	Default	Description
`tokens`	list	Required	List of word tokens to be lemmatized

Returns

Type: list
Content: List of lemmatized tokens with words reduced to their dictionary base forms

Required Import

from nltk.stem import WordNetLemmatizer

Lemmatization Algorithm

Uses WordNet Lemmatizer: Dictionary-based approach
Morphological analysis: Considers word structure and meaning
POS-aware: Can use part-of-speech information for better accuracy
Dictionary lookup: Returns actual dictionary words (lemmas)

How Lemmatization Works

WordNet Lemmatizer uses linguistic knowledge to find proper word roots:

Plural → Singular: "products" → "product"
Verb forms: "running", "ran", "runs" → "run"
Adjective forms: "better" → "good", "best" → "good"
Preserves meaning: Returns readable dictionary words

Usage Examples

# Basic lemmatization
tokens = ["running", "runner", "easily", "fairly", "products"]
lemmatized = lemmatize_tokens(tokens)
# Result: ["running", "runner", "easily", "fairly", "product"]

# Comparison with stemming
stemmed = ["run", "runner", "easili", "fairli", "product"]
lemmatized = ["running", "runner", "easily", "fairly", "product"]

# Sentiment words
sentiment_tokens = ["amazing", "terrible", "disappointing", "recommended"]
lemmatized = lemmatize_tokens(sentiment_tokens)
# Result: ["amazing", "terrible", "disappointing", "recommended"]

# Apply to DataFrame
df['lemmatized_tokens'] = df['filtered_tokens'].apply(lemmatize_tokens)

Lemmatization vs Stemming Comparison

Aspect	Stemming	Lemmatization
Method	Rule-based suffix removal	Dictionary-based lookup
Output	Word stems (may not be real words)	Valid dictionary words
Accuracy	Fast but less accurate	Slower but more accurate
Readability	Poor ("amaz", "disappoint")	Excellent ("amazing", "disappointing")
Processing Speed	Faster	Slower

Common Lemmatization Results

Original	Stemmed	Lemmatized
running, runs, ran	run, run, ran	run, run, run
better, best	better, best	good, good
mice	mice	mouse
feet	feet	foot
amazing	amaz	amazing
children	children	child

Advantages

Readable output: Returns actual dictionary words
Context awareness: Better handling of irregular words
Meaning preservation: Maintains semantic relationships
Quality normalization: More accurate than stemming

Limitations

Slower processing: Requires dictionary lookups
POS dependency: Works better with part-of-speech tags
Language specific: Requires language-specific resources
Incomplete coverage: Some words may not be in dictionary

Enhanced Usage with POS Tags

# More accurate lemmatization with POS tags
import nltk
from nltk import pos_tag

def enhanced_lemmatize(tokens):
    lemmatizer = WordNetLemmatizer()
    pos_tags = pos_tag(tokens)
    
    # Map POS tags to WordNet format
    tag_map = {'N': 'n', 'V': 'v', 'R': 'r', 'J': 'a'}
    
    lemmatized = []
    for token, tag in pos_tags:
        wordnet_tag = tag_map.get(tag[0], 'n')  # Default to noun
        lemmatized.append(lemmatizer.lemmatize(token, pos=wordnet_tag))
    
    return lemmatized

Prerequisites

Requires NLTK library installation
Download required data: nltk.download('wordnet'), nltk.download('omw-1.4')
Input should be filtered tokens (output from remove_stopwords())

Pipeline Position

Alternative to stem_tokens() in the preprocessing pipeline:

Text cleaning (clean_text())
Tokenization (tokenize_text())
Stopword removal (remove_stopwords())
Either stemming OR lemmatization (not both)

Recommendation

Choose lemmatization when:

Readability of processed text is important
Accuracy is prioritized over speed
Working with smaller datasets
Need to preserve word meaning for interpretation