[function] create_binary_sentiment() - P3chys/textmining GitHub Wiki
create_binary_sentiment()
Function: Purpose
Creates binary sentiment labels from numerical star ratings, converting continuous rating values into categorical sentiment classes for classification tasks.
Syntax
create_binary_sentiment(df, rating_column='star_rating', threshold_positive=4, threshold_negative=2)
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
df |
pandas.DataFrame | Required | DataFrame containing review data with star ratings |
rating_column |
str | 'star_rating' | Name of column containing numerical ratings |
threshold_positive |
int | 4 | Minimum rating (inclusive) for positive sentiment |
threshold_negative |
int | 2 | Maximum rating (inclusive) for negative sentiment |
Returns
- Type: pandas.DataFrame
- Content: Copy of input DataFrame with additional 'sentiment' column
Classification Logic
The function creates three sentiment categories based on rating thresholds:
Default Thresholds (1-5 star scale):
- Positive: Ratings ≥ 4 (4-5 stars)
- Negative: Ratings ≤ 2 (1-2 stars)
- Neutral: Ratings = 3 (3 stars)
Sentiment Mapping:
# Rating → Sentiment
1 star → 'negative'
2 stars → 'negative'
3 stars → 'neutral'
4 stars → 'positive'
5 stars → 'positive'
Required Import
import numpy as np
Implementation Details
Uses numpy.select()
for efficient conditional assignment:
- Condition Evaluation: Checks rating thresholds sequentially
- Label Assignment: Applies corresponding sentiment labels
- Default Handling: Assigns 'neutral' to ratings between thresholds
Usage Examples
# Basic usage with default thresholds
df_with_sentiment = create_binary_sentiment(df)
# Custom thresholds for different rating scales
# Example: 1-10 scale
df_10_scale = create_binary_sentiment(
df,
threshold_positive=7, # 7-10 = positive
threshold_negative=4 # 1-4 = negative
)
# Strict positive/negative classification (no neutral)
df_strict = create_binary_sentiment(
df,
threshold_positive=4, # 4-5 = positive
threshold_negative=3 # 1-3 = negative
)
# Conservative classification (only extreme ratings)
df_conservative = create_binary_sentiment(
df,
threshold_positive=5, # Only 5 = positive
threshold_negative=1 # Only 1 = negative
)
# Custom rating column
df_headline_sentiment = create_binary_sentiment(
df,
rating_column='headline_rating',
threshold_positive=4,
threshold_negative=2
)
Data Validation and Quality
Input Validation:
# Check rating distribution before processing
print(df['star_rating'].value_counts().sort_index())
# Validate rating range
assert df['star_rating'].min() >= 1
assert df['star_rating'].max() <= 5
Output Analysis:
# Examine sentiment distribution
sentiment_dist = df_with_sentiment['sentiment'].value_counts()
print(sentiment_dist)
print(f"Positive: {sentiment_dist['positive']}")
print(f"Negative: {sentiment_dist['negative']}")
print(f"Neutral: {sentiment_dist['neutral']}")
Handling Edge Cases
Missing Ratings:
# Handle missing ratings before sentiment creation
df_clean = df.dropna(subset=['star_rating'])
df_with_sentiment = create_binary_sentiment(df_clean)
Non-Standard Rating Scales:
# Convert 0-10 scale to 1-5 scale
df['star_rating_normalized'] = (df['rating_0_10'] / 2) + 0.5
df_with_sentiment = create_binary_sentiment(df)
Business Logic Considerations
Threshold Selection Strategy:
Conservative Approach (Default):
- Positive: 4-5 stars (clearly satisfied)
- Negative: 1-2 stars (clearly dissatisfied)
- Neutral: 3 stars (ambiguous)
Aggressive Approach:
- Positive: 4-5 stars
- Negative: 1-3 stars (treat neutral as negative)
- No neutral category
Ultra-Conservative Approach:
- Positive: 5 stars only (only highly satisfied)
- Negative: 1-2 stars (clearly dissatisfied)
- Neutral: 3-4 stars (everything else)
Integration with Machine Learning
Preparing for Classification:
# Create sentiment labels
df_sentiment = create_binary_sentiment(df)
# Filter for binary classification (remove neutral)
df_binary = df_sentiment[df_sentiment['sentiment'] != 'neutral']
# Encode labels for ML algorithms
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_binary['sentiment_encoded'] = le.fit_transform(df_binary['sentiment'])
# Result: 'positive' → 1, 'negative' → 0
Multi-class Classification:
# Keep all three categories
X = df_sentiment['processed_text_string']
y = df_sentiment['sentiment']
# Use with algorithms that support multi-class classification
Evaluation and Validation
Sentiment Distribution Analysis:
def analyze_sentiment_distribution(df):
# Overall distribution
overall = df['sentiment'].value_counts(normalize=True)
# Distribution by rating
by_rating = df.groupby('star_rating')['sentiment'].value_counts(normalize=True)
return overall, by_rating
# Usage
overall_dist, rating_dist = analyze_sentiment_distribution(df_with_sentiment)
Threshold Sensitivity Analysis:
def threshold_sensitivity_analysis(df, rating_col='star_rating'):
results = []
for pos_thresh in [3, 4, 5]:
for neg_thresh in [1, 2, 3]:
if pos_thresh <= neg_thresh:
continue
df_temp = create_binary_sentiment(df, rating_col, pos_thresh, neg_thresh)
dist = df_temp['sentiment'].value_counts(normalize=True)
results.append({
'pos_threshold': pos_thresh,
'neg_threshold': neg_thresh,
'positive_pct': dist.get('positive', 0),
'negative_pct': dist.get('negative', 0),
'neutral_pct': dist.get('neutral', 0)
})
return pd.DataFrame(results)
Common Use Cases
Sentiment Analysis Training:
- Use as ground truth labels for supervised learning
- Train models to predict sentiment from text
Business Metrics:
- Calculate customer satisfaction percentages
- Track sentiment trends over time
- Compare sentiment across product categories
Data Filtering:
- Focus analysis on clearly positive/negative reviews
- Exclude ambiguous neutral ratings from certain analyses
A/B Testing:
- Compare sentiment distributions between different groups
- Measure impact of changes on customer satisfaction