[function] extract_sentiment_features() - P3chys/textmining GitHub Wiki

Function: `extract_sentiment_features()`

Purpose

Extracts numerical sentiment and stylistic features from review text using TextBlob and linguistic analysis, creating additional features for enhanced sentiment classification and customer behavior analysis.

Syntax

extract_sentiment_features(df)

Parameters

Parameter	Type	Default	Description
`df`	pandas.DataFrame	Required	DataFrame containing review data with 'review_body' column

Returns

Type: pandas.DataFrame
Content: DataFrame with 6 sentiment/stylistic feature columns

Dependencies

from textblob import TextBlob
import pandas as pd
import numpy as np

Output Feature Columns

Feature	Type	Range	Description
`polarity`	float	[-1.0, 1.0]	TextBlob sentiment polarity (-1=negative, 0=neutral, 1=positive)
`subjectivity`	float	[0.0, 1.0]	TextBlob subjectivity score (0=objective, 1=subjective)
`exclamation_count`	int	[0, ∞)	Number of exclamation marks (!) in text
`question_count`	int	[0, ∞)	Number of question marks (?) in text
`caps_ratio`	float	[0.0, 1.0]	Ratio of uppercase characters to total characters
`avg_word_length`	float	[1.0, ∞)	Average word length in characters

Algorithm Details

1. TextBlob Sentiment Analysis

blob = TextBlob(text)
polarity = blob.sentiment.polarity      # -1 to 1
subjectivity = blob.sentiment.subjectivity  # 0 to 1

2. Punctuation Analysis

exclamation_count = text.count('!')
question_count = text.count('?')

3. Capitalization Analysis

caps_ratio = sum(1 for c in text if c.isupper()) / len(text)

4. Word Length Analysis

avg_word_length = np.mean([len(word) for word in text.split()])

Feature Interpretations

Polarity

Positive values (0 to 1): Indicates positive sentiment language
Negative values (-1 to 0): Indicates negative sentiment language
Near zero: Neutral or mixed sentiment
Use case: Baseline sentiment score for comparison with star ratings

Subjectivity

High values (0.7-1.0): Personal opinions, emotions, judgments
Low values (0.0-0.3): Factual information, objective statements
Use case: Distinguish between emotional reviews and factual descriptions

Exclamation Count

High counts: Expression of strong emotion (positive or negative)
Zero counts: Calm, measured tone
Use case: Identify passionate customers and emotional intensity

Question Count

High counts: Confusion, uncertainty, rhetorical emphasis
Zero counts: Direct statements
Use case: Detect customer confusion or engagement patterns

Caps Ratio

High ratios (>0.1): Shouting, emphasis, frustration
Low ratios (<0.05): Normal writing style
Use case: Identify angry customers or excitement

Average Word Length

Long words (>6 chars): Formal writing, detailed descriptions
Short words (<4 chars): Casual writing, basic language
Use case: Assess review sophistication and thoroughness

Data Preprocessing

Text Conversion

df['review_body'] = df['review_body'].astype(str)

Converts all review text to string type
Handles NaN values by converting to string "nan"
Ensures consistent text processing

Error Handling

Division by Zero

Issue: Empty text strings cause division by zero in caps_ratio
Behavior: Results in runtime error
Mitigation: Consider adding length check before ratio calculation

Missing Dependencies

Issue: TextBlob may not be installed
Behavior: ImportError on function call
Mitigation: Install via pip install textblob

Performance Considerations

Time Complexity: O(n × m) where n=number of reviews, m=average text length
Memory Usage: Creates complete feature matrix in memory
TextBlob Overhead: Each TextBlob object creation adds processing time
Optimization: Consider batch processing for very large datasets

Usage Example

# Extract sentiment features
sentiment_df = extract_sentiment_features(df_processed)

# Preview features
print(sentiment_df.head())
#      polarity  subjectivity  exclamation_count  question_count  caps_ratio  avg_word_length
# 0    0.1500    0.4000        2                  0               0.0234      4.5600
# 1   -0.2500    0.7500        0                  1               0.0100      5.2300
# 2    0.8000    0.9000        3                  0               0.0500      3.8900

# Combine with original DataFrame
df_enhanced = pd.concat([df_processed, sentiment_df], axis=1)

# Use for analysis
high_emotion_reviews = df_enhanced[df_enhanced['exclamation_count'] > 2]
confused_customers = df_enhanced[df_enhanced['question_count'] > 1]

Integration Notes

Feature Engineering: Often combined with other features for ML models
Correlation Analysis: Compare TextBlob polarity with star ratings
Customer Segmentation: Use subjectivity and caps_ratio to identify customer types
Quality Assessment: Use avg_word_length to filter detailed vs. brief reviews

Validation Recommendations

Polarity Accuracy: Compare TextBlob polarity with manual sentiment labels
Subjectivity Calibration: Verify subjectivity scores against review content
Outlier Detection: Check for extreme values in punctuation and caps features
Business Alignment: Ensure features correlate with business-relevant outcomes