[function] preprocess_dataframe() - P3chys/textmining GitHub Wiki

Function: preprocess_dataframe()

Purpose

Applies the complete text preprocessing pipeline to an entire DataFrame column, efficiently processing multiple reviews at once while preserving the original data structure.

Syntax

preprocess_dataframe(df, text_column='review_body', new_column='processed_text',
                    remove_stops=True, stem=False, lemmatize=True,
                    extra_stopwords=None, keep_sentiment_words=True)

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing review data
text_column str 'review_body' Name of column containing raw text to process
new_column str 'processed_text' Name for new column storing processed tokens
remove_stops bool True Whether to remove stopwords
stem bool False Whether to apply stemming
lemmatize bool True Whether to apply lemmatization
extra_stopwords list/set or None None Additional custom stopwords
keep_sentiment_words bool True Whether to preserve sentiment-bearing stopwords

Returns

  • Type: pandas.DataFrame
  • Content: Copy of original DataFrame with two new columns:
    • new_column: List of processed tokens for each review
    • 'processed_text_string': Space-joined string version of tokens

Processing Workflow

  1. DataFrame Duplication: Creates copy to preserve original data
  2. Batch Processing: Applies preprocess_text() to each row via .apply()
  3. Token Storage: Stores processed tokens as lists in new column
  4. String Conversion: Creates readable string version for easy inspection

Data Structure Output

Original DataFrame:

| review_id | review_body                  | star_rating |
|-----------|------------------------------|-------------|
| 1         | "This product is amazing!"   | 5           |
| 2         | "Not worth the money"        | 2           |

After Processing:

| review_id | review_body              | star_rating | processed_text           | processed_text_string        |
|-----------|--------------------------|-------------|--------------------------|------------------------------|
| 1         | "This product is amazing!" | 5           | ['product', 'amazing']   | "product amazing"            |
| 2         | "Not worth the money"      | 2           | ['not', 'worth', 'money'] | "not worth money"           |

Usage Examples

# Basic usage on standard column
df_processed = preprocess_dataframe(df)

# Custom column names
df_processed = preprocess_dataframe(
    df, 
    text_column='review_headline', 
    new_column='processed_headline'
)

# Domain-specific preprocessing
ecommerce_stops = ['amazon', 'product', 'item', 'seller', 'purchase']
df_processed = preprocess_dataframe(
    df,
    extra_stopwords=ecommerce_stops,
    keep_sentiment_words=True
)

# Speed-optimized processing
df_processed = preprocess_dataframe(
    df,
    stem=True,
    lemmatize=False,
    remove_stops=True
)

# Minimal preprocessing for special analysis
df_processed = preprocess_dataframe(
    df,
    remove_stops=False,
    lemmatize=False,
    new_column='minimal_processed'
)

Memory and Performance Considerations

Memory Usage:

  • Creates full copy of DataFrame
  • Stores both token lists and string versions
  • Memory requirements: ~3x original DataFrame size

Processing Time:

  • Linear with DataFrame size
  • Each row processed independently
  • Time complexity: O(n) where n = number of rows

Optimization Strategies:

# Process in chunks for large datasets
def process_large_dataframe(df, chunk_size=1000):
    processed_chunks = []
    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i+chunk_size]
        processed_chunk = preprocess_dataframe(chunk)
        processed_chunks.append(processed_chunk)
    return pd.concat(processed_chunks, ignore_index=True)

Error Handling and Robustness

Input Validation:

  • Handles missing values in text column gracefully
  • Converts input to DataFrame format automatically
  • Preserves non-text columns unchanged

Processing Errors:

  • Individual row failures don't crash entire operation
  • Invalid text entries return empty token lists
  • Maintains DataFrame structure integrity

Quality Assurance Features

Preview Functionality:

# Check preprocessing results
df_sample = df.head(5)
processed_sample = preprocess_dataframe(df_sample)
print(processed_sample['review_body', 'processed_text_string'](/P3chys/textmining/wiki/'review_body',-'processed_text_string'))

Validation Methods:

# Verify processing worked correctly
def validate_preprocessing(df_processed, original_col, processed_col):
    # Check for empty processed texts
    empty_count = df_processed[processed_col].apply(len).eq(0).sum()
    print(f"Empty processed texts: {empty_count}")
    
    # Show sample transformations
    sample = df_processed[original_col, 'processed_text_string'](/P3chys/textmining/wiki/original_col,-'processed_text_string').head()
    print(sample.to_string())

Integration with Analysis Pipeline

Common Next Steps:

# 1. Text analysis
df_processed = preprocess_dataframe(df)

# 2. Feature extraction for ML
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(df_processed['processed_text_string'])

# 3. Sentiment analysis
df_processed['sentiment'] = df_processed['processed_text'].apply(sentiment_analyzer)

# 4. Topic modeling
df_processed['topic'] = df_processed['processed_text'].apply(topic_classifier)

Prerequisites

  • All requirements from preprocess_text() function
  • pandas library for DataFrame operations
  • Sufficient memory for DataFrame duplication

Best Practices

Column Naming:

  • Use descriptive names for processed columns
  • Include processing parameters in column names for tracking
  • Example: 'processed_text_lemma_no_stops'

Parameter Documentation:

# Document preprocessing parameters used
processing_params = {
    'remove_stops': True,
    'stem': False,
    'lemmatize': True,
    'extra_stopwords': ['amazon', 'product'],
    'keep_sentiment_words': True
}

# Store in DataFrame metadata or separate file for reproducibility