[function] extract_sentiment_features() - P3chys/textmining GitHub Wiki

Function: extract_sentiment_features()

Purpose

Extracts numerical sentiment and stylistic features from review text using TextBlob and linguistic analysis, creating additional features for enhanced sentiment classification and customer behavior analysis.

Syntax

extract_sentiment_features(df)

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing review data with 'review_body' column

Returns

  • Type: pandas.DataFrame
  • Content: DataFrame with 6 sentiment/stylistic feature columns

Dependencies

from textblob import TextBlob
import pandas as pd
import numpy as np

Output Feature Columns

Feature Type Range Description
polarity float [-1.0, 1.0] TextBlob sentiment polarity (-1=negative, 0=neutral, 1=positive)
subjectivity float [0.0, 1.0] TextBlob subjectivity score (0=objective, 1=subjective)
exclamation_count int [0, ∞) Number of exclamation marks (!) in text
question_count int [0, ∞) Number of question marks (?) in text
caps_ratio float [0.0, 1.0] Ratio of uppercase characters to total characters
avg_word_length float [1.0, ∞) Average word length in characters

Algorithm Details

1. TextBlob Sentiment Analysis

blob = TextBlob(text)
polarity = blob.sentiment.polarity      # -1 to 1
subjectivity = blob.sentiment.subjectivity  # 0 to 1

2. Punctuation Analysis

exclamation_count = text.count('!')
question_count = text.count('?')

3. Capitalization Analysis

caps_ratio = sum(1 for c in text if c.isupper()) / len(text)

4. Word Length Analysis

avg_word_length = np.mean([len(word) for word in text.split()])

Feature Interpretations

Polarity

  • Positive values (0 to 1): Indicates positive sentiment language
  • Negative values (-1 to 0): Indicates negative sentiment language
  • Near zero: Neutral or mixed sentiment
  • Use case: Baseline sentiment score for comparison with star ratings

Subjectivity

  • High values (0.7-1.0): Personal opinions, emotions, judgments
  • Low values (0.0-0.3): Factual information, objective statements
  • Use case: Distinguish between emotional reviews and factual descriptions

Exclamation Count

  • High counts: Expression of strong emotion (positive or negative)
  • Zero counts: Calm, measured tone
  • Use case: Identify passionate customers and emotional intensity

Question Count

  • High counts: Confusion, uncertainty, rhetorical emphasis
  • Zero counts: Direct statements
  • Use case: Detect customer confusion or engagement patterns

Caps Ratio

  • High ratios (>0.1): Shouting, emphasis, frustration
  • Low ratios (<0.05): Normal writing style
  • Use case: Identify angry customers or excitement

Average Word Length

  • Long words (>6 chars): Formal writing, detailed descriptions
  • Short words (<4 chars): Casual writing, basic language
  • Use case: Assess review sophistication and thoroughness

Data Preprocessing

Text Conversion

df['review_body'] = df['review_body'].astype(str)
  • Converts all review text to string type
  • Handles NaN values by converting to string "nan"
  • Ensures consistent text processing

Error Handling

Division by Zero

  • Issue: Empty text strings cause division by zero in caps_ratio
  • Behavior: Results in runtime error
  • Mitigation: Consider adding length check before ratio calculation

Missing Dependencies

  • Issue: TextBlob may not be installed
  • Behavior: ImportError on function call
  • Mitigation: Install via pip install textblob

Performance Considerations

  • Time Complexity: O(n × m) where n=number of reviews, m=average text length
  • Memory Usage: Creates complete feature matrix in memory
  • TextBlob Overhead: Each TextBlob object creation adds processing time
  • Optimization: Consider batch processing for very large datasets

Usage Example

# Extract sentiment features
sentiment_df = extract_sentiment_features(df_processed)

# Preview features
print(sentiment_df.head())
#      polarity  subjectivity  exclamation_count  question_count  caps_ratio  avg_word_length
# 0    0.1500    0.4000        2                  0               0.0234      4.5600
# 1   -0.2500    0.7500        0                  1               0.0100      5.2300
# 2    0.8000    0.9000        3                  0               0.0500      3.8900

# Combine with original DataFrame
df_enhanced = pd.concat([df_processed, sentiment_df], axis=1)

# Use for analysis
high_emotion_reviews = df_enhanced[df_enhanced['exclamation_count'] > 2]
confused_customers = df_enhanced[df_enhanced['question_count'] > 1]

Integration Notes

  • Feature Engineering: Often combined with other features for ML models
  • Correlation Analysis: Compare TextBlob polarity with star ratings
  • Customer Segmentation: Use subjectivity and caps_ratio to identify customer types
  • Quality Assessment: Use avg_word_length to filter detailed vs. brief reviews

Validation Recommendations

  • Polarity Accuracy: Compare TextBlob polarity with manual sentiment labels
  • Subjectivity Calibration: Verify subjectivity scores against review content
  • Outlier Detection: Check for extreme values in punctuation and caps features
  • Business Alignment: Ensure features correlate with business-relevant outcomes