[function] extract_sentiment_features() - P3chys/textmining GitHub Wiki
Function: extract_sentiment_features()
Purpose
Extracts numerical sentiment and stylistic features from review text using TextBlob and linguistic analysis, creating additional features for enhanced sentiment classification and customer behavior analysis.
Syntax
extract_sentiment_features(df)
Parameters
Parameter |
Type |
Default |
Description |
df |
pandas.DataFrame |
Required |
DataFrame containing review data with 'review_body' column |
Returns
- Type: pandas.DataFrame
- Content: DataFrame with 6 sentiment/stylistic feature columns
Dependencies
from textblob import TextBlob
import pandas as pd
import numpy as np
Output Feature Columns
Feature |
Type |
Range |
Description |
polarity |
float |
[-1.0, 1.0] |
TextBlob sentiment polarity (-1=negative, 0=neutral, 1=positive) |
subjectivity |
float |
[0.0, 1.0] |
TextBlob subjectivity score (0=objective, 1=subjective) |
exclamation_count |
int |
[0, ∞) |
Number of exclamation marks (!) in text |
question_count |
int |
[0, ∞) |
Number of question marks (?) in text |
caps_ratio |
float |
[0.0, 1.0] |
Ratio of uppercase characters to total characters |
avg_word_length |
float |
[1.0, ∞) |
Average word length in characters |
Algorithm Details
1. TextBlob Sentiment Analysis
blob = TextBlob(text)
polarity = blob.sentiment.polarity # -1 to 1
subjectivity = blob.sentiment.subjectivity # 0 to 1
2. Punctuation Analysis
exclamation_count = text.count('!')
question_count = text.count('?')
3. Capitalization Analysis
caps_ratio = sum(1 for c in text if c.isupper()) / len(text)
4. Word Length Analysis
avg_word_length = np.mean([len(word) for word in text.split()])
Feature Interpretations
Polarity
- Positive values (0 to 1): Indicates positive sentiment language
- Negative values (-1 to 0): Indicates negative sentiment language
- Near zero: Neutral or mixed sentiment
- Use case: Baseline sentiment score for comparison with star ratings
Subjectivity
- High values (0.7-1.0): Personal opinions, emotions, judgments
- Low values (0.0-0.3): Factual information, objective statements
- Use case: Distinguish between emotional reviews and factual descriptions
Exclamation Count
- High counts: Expression of strong emotion (positive or negative)
- Zero counts: Calm, measured tone
- Use case: Identify passionate customers and emotional intensity
Question Count
- High counts: Confusion, uncertainty, rhetorical emphasis
- Zero counts: Direct statements
- Use case: Detect customer confusion or engagement patterns
Caps Ratio
- High ratios (>0.1): Shouting, emphasis, frustration
- Low ratios (<0.05): Normal writing style
- Use case: Identify angry customers or excitement
Average Word Length
- Long words (>6 chars): Formal writing, detailed descriptions
- Short words (<4 chars): Casual writing, basic language
- Use case: Assess review sophistication and thoroughness
Data Preprocessing
Text Conversion
df['review_body'] = df['review_body'].astype(str)
- Converts all review text to string type
- Handles NaN values by converting to string "nan"
- Ensures consistent text processing
Error Handling
Division by Zero
- Issue: Empty text strings cause division by zero in caps_ratio
- Behavior: Results in runtime error
- Mitigation: Consider adding length check before ratio calculation
Missing Dependencies
- Issue: TextBlob may not be installed
- Behavior: ImportError on function call
- Mitigation: Install via
pip install textblob
Performance Considerations
- Time Complexity: O(n × m) where n=number of reviews, m=average text length
- Memory Usage: Creates complete feature matrix in memory
- TextBlob Overhead: Each TextBlob object creation adds processing time
- Optimization: Consider batch processing for very large datasets
Usage Example
# Extract sentiment features
sentiment_df = extract_sentiment_features(df_processed)
# Preview features
print(sentiment_df.head())
# polarity subjectivity exclamation_count question_count caps_ratio avg_word_length
# 0 0.1500 0.4000 2 0 0.0234 4.5600
# 1 -0.2500 0.7500 0 1 0.0100 5.2300
# 2 0.8000 0.9000 3 0 0.0500 3.8900
# Combine with original DataFrame
df_enhanced = pd.concat([df_processed, sentiment_df], axis=1)
# Use for analysis
high_emotion_reviews = df_enhanced[df_enhanced['exclamation_count'] > 2]
confused_customers = df_enhanced[df_enhanced['question_count'] > 1]
Integration Notes
- Feature Engineering: Often combined with other features for ML models
- Correlation Analysis: Compare TextBlob polarity with star ratings
- Customer Segmentation: Use subjectivity and caps_ratio to identify customer types
- Quality Assessment: Use avg_word_length to filter detailed vs. brief reviews
Validation Recommendations
- Polarity Accuracy: Compare TextBlob polarity with manual sentiment labels
- Subjectivity Calibration: Verify subjectivity scores against review content
- Outlier Detection: Check for extreme values in punctuation and caps features
- Business Alignment: Ensure features correlate with business-relevant outcomes