[function] preprocess_dataframe() - P3chys/textmining GitHub Wiki
Function: preprocess_dataframe()
Purpose
Applies the complete text preprocessing pipeline to an entire DataFrame column, efficiently processing multiple reviews at once while preserving the original data structure.
Syntax
preprocess_dataframe(df, text_column='review_body', new_column='processed_text',
remove_stops=True, stem=False, lemmatize=True,
extra_stopwords=None, keep_sentiment_words=True)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pandas.DataFrame | Required | DataFrame containing review data |
text_column |
str | 'review_body' | Name of column containing raw text to process |
new_column |
str | 'processed_text' | Name for new column storing processed tokens |
remove_stops |
bool | True | Whether to remove stopwords |
stem |
bool | False | Whether to apply stemming |
lemmatize |
bool | True | Whether to apply lemmatization |
extra_stopwords |
list/set or None | None | Additional custom stopwords |
keep_sentiment_words |
bool | True | Whether to preserve sentiment-bearing stopwords |
Returns
- Type: pandas.DataFrame
- Content: Copy of original DataFrame with two new columns:
new_column: List of processed tokens for each review'processed_text_string': Space-joined string version of tokens
Processing Workflow
- DataFrame Duplication: Creates copy to preserve original data
- Batch Processing: Applies
preprocess_text()to each row via.apply() - Token Storage: Stores processed tokens as lists in new column
- String Conversion: Creates readable string version for easy inspection
Data Structure Output
Original DataFrame:
| review_id | review_body | star_rating |
|-----------|------------------------------|-------------|
| 1 | "This product is amazing!" | 5 |
| 2 | "Not worth the money" | 2 |
After Processing:
| review_id | review_body | star_rating | processed_text | processed_text_string |
|-----------|--------------------------|-------------|--------------------------|------------------------------|
| 1 | "This product is amazing!" | 5 | ['product', 'amazing'] | "product amazing" |
| 2 | "Not worth the money" | 2 | ['not', 'worth', 'money'] | "not worth money" |
Usage Examples
# Basic usage on standard column
df_processed = preprocess_dataframe(df)
# Custom column names
df_processed = preprocess_dataframe(
df,
text_column='review_headline',
new_column='processed_headline'
)
# Domain-specific preprocessing
ecommerce_stops = ['amazon', 'product', 'item', 'seller', 'purchase']
df_processed = preprocess_dataframe(
df,
extra_stopwords=ecommerce_stops,
keep_sentiment_words=True
)
# Speed-optimized processing
df_processed = preprocess_dataframe(
df,
stem=True,
lemmatize=False,
remove_stops=True
)
# Minimal preprocessing for special analysis
df_processed = preprocess_dataframe(
df,
remove_stops=False,
lemmatize=False,
new_column='minimal_processed'
)
Memory and Performance Considerations
Memory Usage:
- Creates full copy of DataFrame
- Stores both token lists and string versions
- Memory requirements: ~3x original DataFrame size
Processing Time:
- Linear with DataFrame size
- Each row processed independently
- Time complexity: O(n) where n = number of rows
Optimization Strategies:
# Process in chunks for large datasets
def process_large_dataframe(df, chunk_size=1000):
processed_chunks = []
for i in range(0, len(df), chunk_size):
chunk = df.iloc[i:i+chunk_size]
processed_chunk = preprocess_dataframe(chunk)
processed_chunks.append(processed_chunk)
return pd.concat(processed_chunks, ignore_index=True)
Error Handling and Robustness
Input Validation:
- Handles missing values in text column gracefully
- Converts input to DataFrame format automatically
- Preserves non-text columns unchanged
Processing Errors:
- Individual row failures don't crash entire operation
- Invalid text entries return empty token lists
- Maintains DataFrame structure integrity
Quality Assurance Features
Preview Functionality:
# Check preprocessing results
df_sample = df.head(5)
processed_sample = preprocess_dataframe(df_sample)
print(processed_sample['review_body', 'processed_text_string'](/P3chys/textmining/wiki/'review_body',-'processed_text_string'))
Validation Methods:
# Verify processing worked correctly
def validate_preprocessing(df_processed, original_col, processed_col):
# Check for empty processed texts
empty_count = df_processed[processed_col].apply(len).eq(0).sum()
print(f"Empty processed texts: {empty_count}")
# Show sample transformations
sample = df_processed[original_col, 'processed_text_string'](/P3chys/textmining/wiki/original_col,-'processed_text_string').head()
print(sample.to_string())
Integration with Analysis Pipeline
Common Next Steps:
# 1. Text analysis
df_processed = preprocess_dataframe(df)
# 2. Feature extraction for ML
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(df_processed['processed_text_string'])
# 3. Sentiment analysis
df_processed['sentiment'] = df_processed['processed_text'].apply(sentiment_analyzer)
# 4. Topic modeling
df_processed['topic'] = df_processed['processed_text'].apply(topic_classifier)
Prerequisites
- All requirements from
preprocess_text()function - pandas library for DataFrame operations
- Sufficient memory for DataFrame duplication
Best Practices
Column Naming:
- Use descriptive names for processed columns
- Include processing parameters in column names for tracking
- Example:
'processed_text_lemma_no_stops'
Parameter Documentation:
# Document preprocessing parameters used
processing_params = {
'remove_stops': True,
'stem': False,
'lemmatize': True,
'extra_stopwords': ['amazon', 'product'],
'keep_sentiment_words': True
}
# Store in DataFrame metadata or separate file for reproducibility