[function] preprocess_dataframe() - P3chys/textmining GitHub Wiki

Function: `preprocess_dataframe()`

Purpose

Applies the complete text preprocessing pipeline to an entire DataFrame column, efficiently processing multiple reviews at once while preserving the original data structure.

Syntax

preprocess_dataframe(df, text_column='review_body', new_column='processed_text',
                    remove_stops=True, stem=False, lemmatize=True,
                    extra_stopwords=None, keep_sentiment_words=True)

Parameters

Parameter	Type	Default	Description
`df`	pandas.DataFrame	Required	DataFrame containing review data
`text_column`	str	'review_body'	Name of column containing raw text to process
`new_column`	str	'processed_text'	Name for new column storing processed tokens
`remove_stops`	bool	True	Whether to remove stopwords
`stem`	bool	False	Whether to apply stemming
`lemmatize`	bool	True	Whether to apply lemmatization
`extra_stopwords`	list/set or None	None	Additional custom stopwords
`keep_sentiment_words`	bool	True	Whether to preserve sentiment-bearing stopwords

Returns

Type: pandas.DataFrame
Content: Copy of original DataFrame with two new columns:
- new_column: List of processed tokens for each review
- 'processed_text_string': Space-joined string version of tokens

Processing Workflow

DataFrame Duplication: Creates copy to preserve original data
Batch Processing: Applies preprocess_text() to each row via .apply()
Token Storage: Stores processed tokens as lists in new column
String Conversion: Creates readable string version for easy inspection

Data Structure Output

Original DataFrame:

| review_id | review_body                  | star_rating |
|-----------|------------------------------|-------------|
| 1         | "This product is amazing!"   | 5           |
| 2         | "Not worth the money"        | 2           |

After Processing:

| review_id | review_body              | star_rating | processed_text           | processed_text_string        |
|-----------|--------------------------|-------------|--------------------------|------------------------------|
| 1         | "This product is amazing!" | 5           | ['product', 'amazing']   | "product amazing"            |
| 2         | "Not worth the money"      | 2           | ['not', 'worth', 'money'] | "not worth money"           |

Usage Examples

# Basic usage on standard column
df_processed = preprocess_dataframe(df)

# Custom column names
df_processed = preprocess_dataframe(
    df, 
    text_column='review_headline', 
    new_column='processed_headline'
)

# Domain-specific preprocessing
ecommerce_stops = ['amazon', 'product', 'item', 'seller', 'purchase']
df_processed = preprocess_dataframe(
    df,
    extra_stopwords=ecommerce_stops,
    keep_sentiment_words=True
)

# Speed-optimized processing
df_processed = preprocess_dataframe(
    df,
    stem=True,
    lemmatize=False,
    remove_stops=True
)

# Minimal preprocessing for special analysis
df_processed = preprocess_dataframe(
    df,
    remove_stops=False,
    lemmatize=False,
    new_column='minimal_processed'
)

Memory and Performance Considerations

Memory Usage:

Creates full copy of DataFrame
Stores both token lists and string versions
Memory requirements: ~3x original DataFrame size

Processing Time:

Linear with DataFrame size
Each row processed independently
Time complexity: O(n) where n = number of rows

Optimization Strategies:

# Process in chunks for large datasets
def process_large_dataframe(df, chunk_size=1000):
    processed_chunks = []
    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i+chunk_size]
        processed_chunk = preprocess_dataframe(chunk)
        processed_chunks.append(processed_chunk)
    return pd.concat(processed_chunks, ignore_index=True)

Error Handling and Robustness

Input Validation:

Handles missing values in text column gracefully
Converts input to DataFrame format automatically
Preserves non-text columns unchanged

Processing Errors:

Individual row failures don't crash entire operation
Invalid text entries return empty token lists
Maintains DataFrame structure integrity

Quality Assurance Features

Preview Functionality:

# Check preprocessing results
df_sample = df.head(5)
processed_sample = preprocess_dataframe(df_sample)
print(processed_sample['review_body', 'processed_text_string'](/P3chys/textmining/wiki/'review_body',-'processed_text_string'))

Validation Methods:

# Verify processing worked correctly
def validate_preprocessing(df_processed, original_col, processed_col):
    # Check for empty processed texts
    empty_count = df_processed[processed_col].apply(len).eq(0).sum()
    print(f"Empty processed texts: {empty_count}")
    
    # Show sample transformations
    sample = df_processed[original_col, 'processed_text_string'](/P3chys/textmining/wiki/original_col,-'processed_text_string').head()
    print(sample.to_string())

Integration with Analysis Pipeline

Common Next Steps:

# 1. Text analysis
df_processed = preprocess_dataframe(df)

# 2. Feature extraction for ML
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(df_processed['processed_text_string'])

# 3. Sentiment analysis
df_processed['sentiment'] = df_processed['processed_text'].apply(sentiment_analyzer)

# 4. Topic modeling
df_processed['topic'] = df_processed['processed_text'].apply(topic_classifier)

Prerequisites

All requirements from preprocess_text() function
pandas library for DataFrame operations
Sufficient memory for DataFrame duplication

Best Practices

Column Naming:

Use descriptive names for processed columns
Include processing parameters in column names for tracking
Example: 'processed_text_lemma_no_stops'

Parameter Documentation:

# Document preprocessing parameters used
processing_params = {
    'remove_stops': True,
    'stem': False,
    'lemmatize': True,
    'extra_stopwords': ['amazon', 'product'],
    'keep_sentiment_words': True
}

# Store in DataFrame metadata or separate file for reproducibility