Module 1 1 Morphology Stems and Affixes - iffatAGheyas/NLP-handbook GitHub Wiki
Key Concepts
Stem
The minimal form of a word that carries its core meaning.
- running →
run
- happily →
happi
(as produced by a simple stemmer)
Affix
A bound morpheme attached to a stem to modify its meaning or function.
- Prefix: Added before the stem (e.g.,
un-
+happy
→ unhappy) - Suffix: Added after the stem (e.g.,
happi
+-ness
→ happiness)
Stemming vs. Lemmatization
-
Stemming
A heuristic process that crudely strips affixes from words, which may result in non-words or incomplete roots. -
Lemmatization
Uses vocabulary and part-of-speech (POS) tags to return the dictionary or canonical form of a word (lemma), ensuring valid words.
1. Rule-Based Stemmer (Toy Example)
A simple stemmer that strips common English suffixes:
# simple_stemmer_demo.ipynb
# 1. Define the stemmer
def simple_stemmer(word):
"""Strip common suffixes if word length remains ≥ 3."""
for suffix in ('ing', 'ly', 'ed', 'ness', 's'):
if word.endswith(suffix) and len(word) - len(suffix) >= 3:
return word[:-len(suffix)]
return word
# 2. Sample paragraph
paragraph = (
"Learning natural language processing can be challenging, "
"especially when understanding the underlying patterns requires "
"careful thinking and extensive testing. However, with consistent "
"practice and thoughtful analysis, researchers and students alike "
"find themselves improving steadily and achieving success."
)
# 3. Tokenise into words (simple split, stripping punctuation)
import string
tokens = []
for raw in paragraph.split():
# Remove surrounding punctuation and convert to lowercase
token = raw.strip(string.punctuation).lower()
if token:
tokens.append(token)
# 4. Stem each token
stems = [simple_stemmer(tok) for tok in tokens]
# 5. Display results
print(f"{'Original':<15} → Stemmed")
print("-" * 28)
for orig, stem in zip(tokens, stems):
print(f"{orig:<15} → {stem}")
2. Using NLTK’s Stemmer Implementations
NLTK provides established stemmers like the Porter and Snowball stemmers:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer
nltk.download('punkt') # for tokenization if needed
words = ['running', 'happily', 'tested', 'kindness', 'cats', 'play']
porter = PorterStemmer()
snowball = SnowballStemmer('english')
print("Word Porter Snowball")
for w in words:
print(f"{w:10} {porter.stem(w):10} {snowball.stem(w):10}")
3. Extending to Affix Lists
You can maintain a custom affix list for domain-specific needs:
# Affix Stemmer Example for Jupyter Notebook
# 1. Define your custom affix‐replacement list
AFFIXES = {
'ing': '', # running → runn
'ly': '', # happily → happi
'ed': '', # tested → test
'ization':'ize', # modernization → modernize
'ies': 'y' # studies → study
}
# 2. Define the stemmer function
def affix_stemmer(word):
"""
Strip or replace domain-specific affixes if the remaining stem is ≥ 3 characters.
"""
for affix, replacement in AFFIXES.items():
if word.endswith(affix) and len(word) - len(affix) >= 3:
return word[:-len(affix)] + replacement
return word
# 3. Demo usage
if __name__ == "__main__":
words = ['modernization', 'studies', 'running', 'happily', 'tested', 'cats']
stems = [affix_stemmer(w) for w in words]
# Print side-by-side
print(f"{'Original':<15} → Stemmed")
print("-" * 28)
for orig, stem in zip(words, stems):
print(f"{orig:<15} → {stem}")
4. When to Lemmatize Instead
For more accurate normalization, lemmatization uses vocabulary and part-of-speech:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running', pos='v')) # → run
print(lemmatizer.lemmatize('studies', pos='n')) # → study
Next: Continue to 1.2 Tokenization: Regex & Rule-based Methods