[function] preprocess_text() - P3chys/textmining GitHub Wiki
Function: preprocess_text()
Purpose
Complete end-to-end text preprocessing pipeline that combines all cleaning and normalization steps into a single configurable function.
Syntax
preprocess_text(text, remove_stops=True, stem=False, lemmatize=True,
extra_stopwords=None, keep_sentiment_words=True)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str | Required | Raw text string to be processed |
remove_stops |
bool | True | Whether to remove stopwords from tokens |
stem |
bool | False | Whether to apply stemming (Porter Stemmer) |
lemmatize |
bool | True | Whether to apply lemmatization (WordNet) |
extra_stopwords |
list/set or None | None | Additional custom stopwords to remove |
keep_sentiment_words |
bool | True | Whether to preserve sentiment-bearing stopwords |
Returns
- Type: list
- Content: Fully processed list of tokens ready for analysis
Processing Pipeline
The function executes these steps in sequence:
-
Text Cleaning (
clean_text())- Convert to lowercase
- Remove HTML tags, URLs
- Filter special characters
- Normalize whitespace
-
Tokenization (
tokenize_text())- Split text into individual word tokens
- Handle punctuation and contractions
-
Stopword Removal (
remove_stopwords()- optional)- Remove common English stopwords
- Add custom stopwords if provided
- Preserve sentiment words if specified
-
Stemming (
stem_tokens()- optional)- Apply Porter Stemmer algorithm
- Reduce words to root forms
-
Lemmatization (
lemmatize_tokens()- optional)- Apply WordNet Lemmatizer
- Reduce words to dictionary base forms
Configuration Options
Default Configuration (Recommended for Sentiment Analysis):
# Optimized for sentiment analysis
result = preprocess_text(text) # Uses lemmatization, keeps sentiment words
Speed-Optimized Configuration:
# Faster processing, less accuracy
result = preprocess_text(text, stem=True, lemmatize=False)
Minimal Preprocessing:
# Only cleaning and tokenization
result = preprocess_text(text, remove_stops=False, lemmatize=False)
Domain-Specific Configuration:
# E-commerce specific preprocessing
custom_stops = ['amazon', 'product', 'item', 'seller']
result = preprocess_text(text, extra_stopwords=custom_stops)
Usage Examples
# Basic usage
raw_review = "This product is ABSOLUTELY amazing! I don't recommend it."
processed = preprocess_text(raw_review)
# Result: ['product', 'absolutely', 'amazing', 'not', 'recommend']
# With stemming instead of lemmatization
processed_stem = preprocess_text(raw_review, stem=True, lemmatize=False)
# Result: ['product', 'absolut', 'amaz', 'not', 'recommend']
# Without stopword removal
processed_all = preprocess_text(raw_review, remove_stops=False)
# Result: ['this', 'product', 'be', 'absolutely', 'amazing', 'i', 'do', 'not', 'recommend', 'it']
# Apply to DataFrame
df['processed_reviews'] = df['review_body'].apply(
lambda x: preprocess_text(x, extra_stopwords=['amazon', 'product'])
)
Design Decisions
Default Parameters Rationale:
- Lemmatization over stemming: Better readability and accuracy
- Keep sentiment words: Essential for sentiment analysis
- Remove stopwords: Reduces noise while preserving meaning
- Flexible configuration: Allows customization for different use cases
Performance Considerations
Memory Usage:
- Processing large datasets: Consider chunking
- Each function call creates intermediate objects
Processing Speed:
- Fastest: Clean + Tokenize only
- Medium: With stemming
- Slowest: With lemmatization (dictionary lookups)
Prerequisites
All required imports and NLTK downloads from individual functions:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Required downloads
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
Error Handling
- Handles non-string inputs gracefully via
clean_text() - Returns empty list for invalid inputs
- Continues processing even if individual steps encounter issues
Common Patterns
Batch Processing:
# Process multiple reviews efficiently
processed_reviews = []
for review in df['review_body']:
processed_reviews.append(preprocess_text(review))
df['processed'] = processed_reviews
Pipeline Customization:
# Create domain-specific preprocessor
def ecommerce_preprocess(text):
return preprocess_text(
text,
extra_stopwords=['amazon', 'product', 'item', 'seller'],
keep_sentiment_words=True
)