[function] preprocess_text() - P3chys/textmining GitHub Wiki

Function: `preprocess_text()`

Purpose

Complete end-to-end text preprocessing pipeline that combines all cleaning and normalization steps into a single configurable function.

Syntax

preprocess_text(text, remove_stops=True, stem=False, lemmatize=True, 
                extra_stopwords=None, keep_sentiment_words=True)

Parameters

Parameter	Type	Default	Description
`text`	str	Required	Raw text string to be processed
`remove_stops`	bool	True	Whether to remove stopwords from tokens
`stem`	bool	False	Whether to apply stemming (Porter Stemmer)
`lemmatize`	bool	True	Whether to apply lemmatization (WordNet)
`extra_stopwords`	list/set or None	None	Additional custom stopwords to remove
`keep_sentiment_words`	bool	True	Whether to preserve sentiment-bearing stopwords

Returns

Type: list
Content: Fully processed list of tokens ready for analysis

Processing Pipeline

The function executes these steps in sequence:

Text Cleaning (clean_text())
- Convert to lowercase
- Remove HTML tags, URLs
- Filter special characters
- Normalize whitespace
Tokenization (tokenize_text())
- Split text into individual word tokens
- Handle punctuation and contractions
Stopword Removal (remove_stopwords() - optional)
- Remove common English stopwords
- Add custom stopwords if provided
- Preserve sentiment words if specified
Stemming (stem_tokens() - optional)
- Apply Porter Stemmer algorithm
- Reduce words to root forms
Lemmatization (lemmatize_tokens() - optional)
- Apply WordNet Lemmatizer
- Reduce words to dictionary base forms

Configuration Options

Default Configuration (Recommended for Sentiment Analysis):

# Optimized for sentiment analysis
result = preprocess_text(text)  # Uses lemmatization, keeps sentiment words

Speed-Optimized Configuration:

# Faster processing, less accuracy
result = preprocess_text(text, stem=True, lemmatize=False)

Minimal Preprocessing:

# Only cleaning and tokenization
result = preprocess_text(text, remove_stops=False, lemmatize=False)

Domain-Specific Configuration:

# E-commerce specific preprocessing
custom_stops = ['amazon', 'product', 'item', 'seller']
result = preprocess_text(text, extra_stopwords=custom_stops)

Usage Examples

# Basic usage
raw_review = "This product is ABSOLUTELY amazing! I don't recommend it."
processed = preprocess_text(raw_review)
# Result: ['product', 'absolutely', 'amazing', 'not', 'recommend']

# With stemming instead of lemmatization
processed_stem = preprocess_text(raw_review, stem=True, lemmatize=False)
# Result: ['product', 'absolut', 'amaz', 'not', 'recommend']

# Without stopword removal
processed_all = preprocess_text(raw_review, remove_stops=False)
# Result: ['this', 'product', 'be', 'absolutely', 'amazing', 'i', 'do', 'not', 'recommend', 'it']

# Apply to DataFrame
df['processed_reviews'] = df['review_body'].apply(
    lambda x: preprocess_text(x, extra_stopwords=['amazon', 'product'])
)

Design Decisions

Default Parameters Rationale:

Lemmatization over stemming: Better readability and accuracy
Keep sentiment words: Essential for sentiment analysis
Remove stopwords: Reduces noise while preserving meaning
Flexible configuration: Allows customization for different use cases

Performance Considerations

Memory Usage:

Processing large datasets: Consider chunking
Each function call creates intermediate objects

Processing Speed:

Fastest: Clean + Tokenize only
Medium: With stemming
Slowest: With lemmatization (dictionary lookups)

Prerequisites

All required imports and NLTK downloads from individual functions:

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Required downloads
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

Error Handling

Handles non-string inputs gracefully via clean_text()
Returns empty list for invalid inputs
Continues processing even if individual steps encounter issues

Common Patterns

Batch Processing:

# Process multiple reviews efficiently
processed_reviews = []
for review in df['review_body']:
    processed_reviews.append(preprocess_text(review))
df['processed'] = processed_reviews

Pipeline Customization:

# Create domain-specific preprocessor
def ecommerce_preprocess(text):
    return preprocess_text(
        text, 
        extra_stopwords=['amazon', 'product', 'item', 'seller'],
        keep_sentiment_words=True
    )

[function] preprocess_text() - P3chys/textmining GitHub Wiki