[function] lemmatize_tokens() - P3chys/textmining GitHub Wiki
Function: lemmatize_tokens()
Purpose
Applies lemmatization to reduce words to their dictionary base form (lemma), providing more accurate word normalization than stemming.
Syntax
lemmatize_tokens(tokens)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
tokens |
list | Required | List of word tokens to be lemmatized |
Returns
- Type: list
- Content: List of lemmatized tokens with words reduced to their dictionary base forms
Required Import
from nltk.stem import WordNetLemmatizer
Lemmatization Algorithm
- Uses WordNet Lemmatizer: Dictionary-based approach
- Morphological analysis: Considers word structure and meaning
- POS-aware: Can use part-of-speech information for better accuracy
- Dictionary lookup: Returns actual dictionary words (lemmas)
How Lemmatization Works
WordNet Lemmatizer uses linguistic knowledge to find proper word roots:
- Plural → Singular: "products" → "product"
- Verb forms: "running", "ran", "runs" → "run"
- Adjective forms: "better" → "good", "best" → "good"
- Preserves meaning: Returns readable dictionary words
Usage Examples
# Basic lemmatization
tokens = ["running", "runner", "easily", "fairly", "products"]
lemmatized = lemmatize_tokens(tokens)
# Result: ["running", "runner", "easily", "fairly", "product"]
# Comparison with stemming
stemmed = ["run", "runner", "easili", "fairli", "product"]
lemmatized = ["running", "runner", "easily", "fairly", "product"]
# Sentiment words
sentiment_tokens = ["amazing", "terrible", "disappointing", "recommended"]
lemmatized = lemmatize_tokens(sentiment_tokens)
# Result: ["amazing", "terrible", "disappointing", "recommended"]
# Apply to DataFrame
df['lemmatized_tokens'] = df['filtered_tokens'].apply(lemmatize_tokens)
Lemmatization vs Stemming Comparison
| Aspect | Stemming | Lemmatization |
|---|---|---|
| Method | Rule-based suffix removal | Dictionary-based lookup |
| Output | Word stems (may not be real words) | Valid dictionary words |
| Accuracy | Fast but less accurate | Slower but more accurate |
| Readability | Poor ("amaz", "disappoint") | Excellent ("amazing", "disappointing") |
| Processing Speed | Faster | Slower |
Common Lemmatization Results
| Original | Stemmed | Lemmatized |
|---|---|---|
| running, runs, ran | run, run, ran | run, run, run |
| better, best | better, best | good, good |
| mice | mice | mouse |
| feet | feet | foot |
| amazing | amaz | amazing |
| children | children | child |
Advantages
- Readable output: Returns actual dictionary words
- Context awareness: Better handling of irregular words
- Meaning preservation: Maintains semantic relationships
- Quality normalization: More accurate than stemming
Limitations
- Slower processing: Requires dictionary lookups
- POS dependency: Works better with part-of-speech tags
- Language specific: Requires language-specific resources
- Incomplete coverage: Some words may not be in dictionary
Enhanced Usage with POS Tags
# More accurate lemmatization with POS tags
import nltk
from nltk import pos_tag
def enhanced_lemmatize(tokens):
lemmatizer = WordNetLemmatizer()
pos_tags = pos_tag(tokens)
# Map POS tags to WordNet format
tag_map = {'N': 'n', 'V': 'v', 'R': 'r', 'J': 'a'}
lemmatized = []
for token, tag in pos_tags:
wordnet_tag = tag_map.get(tag[0], 'n') # Default to noun
lemmatized.append(lemmatizer.lemmatize(token, pos=wordnet_tag))
return lemmatized
Prerequisites
- Requires NLTK library installation
- Download required data:
nltk.download('wordnet'),nltk.download('omw-1.4') - Input should be filtered tokens (output from
remove_stopwords())
Pipeline Position
Alternative to stem_tokens() in the preprocessing pipeline:
- Text cleaning (
clean_text()) - Tokenization (
tokenize_text()) - Stopword removal (
remove_stopwords()) - Either stemming OR lemmatization (not both)
Recommendation
Choose lemmatization when:
- Readability of processed text is important
- Accuracy is prioritized over speed
- Working with smaller datasets
- Need to preserve word meaning for interpretation