Module 1 1 Morphology Stems and Affixes - iffatAGheyas/NLP-handbook GitHub Wiki

Key Concepts

Stem

The minimal form of a word that carries its core meaning.

  • runningrun
  • happilyhappi (as produced by a simple stemmer)

Affix

A bound morpheme attached to a stem to modify its meaning or function.

  • Prefix: Added before the stem (e.g., un- + happyunhappy)
  • Suffix: Added after the stem (e.g., happi + -nesshappiness)

Stemming vs. Lemmatization

  • Stemming
    A heuristic process that crudely strips affixes from words, which may result in non-words or incomplete roots.

  • Lemmatization
    Uses vocabulary and part-of-speech (POS) tags to return the dictionary or canonical form of a word (lemma), ensuring valid words.

1. Rule-Based Stemmer (Toy Example)

A simple stemmer that strips common English suffixes:

# simple_stemmer_demo.ipynb

# 1. Define the stemmer
def simple_stemmer(word):
    """Strip common suffixes if word length remains ≥ 3."""
    for suffix in ('ing', 'ly', 'ed', 'ness', 's'):
        if word.endswith(suffix) and len(word) - len(suffix) >= 3:
            return word[:-len(suffix)]
    return word

# 2. Sample paragraph
paragraph = (
    "Learning natural language processing can be challenging, "
    "especially when understanding the underlying patterns requires "
    "careful thinking and extensive testing. However, with consistent "
    "practice and thoughtful analysis, researchers and students alike "
    "find themselves improving steadily and achieving success."
)

# 3. Tokenise into words (simple split, stripping punctuation)
import string

tokens = []
for raw in paragraph.split():
    # Remove surrounding punctuation and convert to lowercase
    token = raw.strip(string.punctuation).lower()
    if token:
        tokens.append(token)

# 4. Stem each token
stems = [simple_stemmer(tok) for tok in tokens]

# 5. Display results
print(f"{'Original':<15} → Stemmed")
print("-" * 28)
for orig, stem in zip(tokens, stems):
    print(f"{orig:<15} → {stem}")

image

2. Using NLTK’s Stemmer Implementations

NLTK provides established stemmers like the Porter and Snowball stemmers:

import nltk
from nltk.stem import PorterStemmer, SnowballStemmer

nltk.download('punkt')  # for tokenization if needed

words = ['running', 'happily', 'tested', 'kindness', 'cats', 'play']
porter = PorterStemmer()
snowball = SnowballStemmer('english')

print("Word       Porter     Snowball")
for w in words:
    print(f"{w:10} {porter.stem(w):10} {snowball.stem(w):10}")

image

3. Extending to Affix Lists

You can maintain a custom affix list for domain-specific needs:

# Affix Stemmer Example for Jupyter Notebook

# 1. Define your custom affix‐replacement list
AFFIXES = {
    'ing':   '',    # running → runn
    'ly':    '',    # happily → happi
    'ed':    '',    # tested → test
    'ization':'ize', # modernization → modernize
    'ies':   'y'    # studies → study
}

# 2. Define the stemmer function
def affix_stemmer(word):
    """
    Strip or replace domain-specific affixes if the remaining stem is ≥ 3 characters.
    """
    for affix, replacement in AFFIXES.items():
        if word.endswith(affix) and len(word) - len(affix) >= 3:
            return word[:-len(affix)] + replacement
    return word

# 3. Demo usage
if __name__ == "__main__":
    words = ['modernization', 'studies', 'running', 'happily', 'tested', 'cats']
    stems = [affix_stemmer(w) for w in words]
    
    # Print side-by-side
    print(f"{'Original':<15} → Stemmed")
    print("-" * 28)
    for orig, stem in zip(words, stems):
        print(f"{orig:<15} → {stem}")

image

4. When to Lemmatize Instead

For more accurate normalization, lemmatization uses vocabulary and part-of-speech:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running', pos='v'))  # → run
print(lemmatizer.lemmatize('studies',  pos='n'))  # → study

image

Next: Continue to 1.2 Tokenization: Regex & Rule-based Methods