[function] preprocess_text() - P3chys/textmining GitHub Wiki

Function: preprocess_text()

Purpose

Complete end-to-end text preprocessing pipeline that combines all cleaning and normalization steps into a single configurable function.

Syntax

preprocess_text(text, remove_stops=True, stem=False, lemmatize=True, 
                extra_stopwords=None, keep_sentiment_words=True)

Parameters

Parameter Type Default Description
text str Required Raw text string to be processed
remove_stops bool True Whether to remove stopwords from tokens
stem bool False Whether to apply stemming (Porter Stemmer)
lemmatize bool True Whether to apply lemmatization (WordNet)
extra_stopwords list/set or None None Additional custom stopwords to remove
keep_sentiment_words bool True Whether to preserve sentiment-bearing stopwords

Returns

  • Type: list
  • Content: Fully processed list of tokens ready for analysis

Processing Pipeline

The function executes these steps in sequence:

  1. Text Cleaning (clean_text())

    • Convert to lowercase
    • Remove HTML tags, URLs
    • Filter special characters
    • Normalize whitespace
  2. Tokenization (tokenize_text())

    • Split text into individual word tokens
    • Handle punctuation and contractions
  3. Stopword Removal (remove_stopwords() - optional)

    • Remove common English stopwords
    • Add custom stopwords if provided
    • Preserve sentiment words if specified
  4. Stemming (stem_tokens() - optional)

    • Apply Porter Stemmer algorithm
    • Reduce words to root forms
  5. Lemmatization (lemmatize_tokens() - optional)

    • Apply WordNet Lemmatizer
    • Reduce words to dictionary base forms

Configuration Options

Default Configuration (Recommended for Sentiment Analysis):

# Optimized for sentiment analysis
result = preprocess_text(text)  # Uses lemmatization, keeps sentiment words

Speed-Optimized Configuration:

# Faster processing, less accuracy
result = preprocess_text(text, stem=True, lemmatize=False)

Minimal Preprocessing:

# Only cleaning and tokenization
result = preprocess_text(text, remove_stops=False, lemmatize=False)

Domain-Specific Configuration:

# E-commerce specific preprocessing
custom_stops = ['amazon', 'product', 'item', 'seller']
result = preprocess_text(text, extra_stopwords=custom_stops)

Usage Examples

# Basic usage
raw_review = "This product is ABSOLUTELY amazing! I don't recommend it."
processed = preprocess_text(raw_review)
# Result: ['product', 'absolutely', 'amazing', 'not', 'recommend']

# With stemming instead of lemmatization
processed_stem = preprocess_text(raw_review, stem=True, lemmatize=False)
# Result: ['product', 'absolut', 'amaz', 'not', 'recommend']

# Without stopword removal
processed_all = preprocess_text(raw_review, remove_stops=False)
# Result: ['this', 'product', 'be', 'absolutely', 'amazing', 'i', 'do', 'not', 'recommend', 'it']

# Apply to DataFrame
df['processed_reviews'] = df['review_body'].apply(
    lambda x: preprocess_text(x, extra_stopwords=['amazon', 'product'])
)

Design Decisions

Default Parameters Rationale:

  • Lemmatization over stemming: Better readability and accuracy
  • Keep sentiment words: Essential for sentiment analysis
  • Remove stopwords: Reduces noise while preserving meaning
  • Flexible configuration: Allows customization for different use cases

Performance Considerations

Memory Usage:

  • Processing large datasets: Consider chunking
  • Each function call creates intermediate objects

Processing Speed:

  • Fastest: Clean + Tokenize only
  • Medium: With stemming
  • Slowest: With lemmatization (dictionary lookups)

Prerequisites

All required imports and NLTK downloads from individual functions:

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Required downloads
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

Error Handling

  • Handles non-string inputs gracefully via clean_text()
  • Returns empty list for invalid inputs
  • Continues processing even if individual steps encounter issues

Common Patterns

Batch Processing:

# Process multiple reviews efficiently
processed_reviews = []
for review in df['review_body']:
    processed_reviews.append(preprocess_text(review))
df['processed'] = processed_reviews

Pipeline Customization:

# Create domain-specific preprocessor
def ecommerce_preprocess(text):
    return preprocess_text(
        text, 
        extra_stopwords=['amazon', 'product', 'item', 'seller'],
        keep_sentiment_words=True
    )