[function] create_binary_sentiment() - P3chys/textmining GitHub Wiki

Function: create_binary_sentiment()

Purpose

Creates binary sentiment labels from numerical star ratings, converting continuous rating values into categorical sentiment classes for classification tasks.

Syntax

create_binary_sentiment(df, rating_column='star_rating', threshold_positive=4, threshold_negative=2)

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing review data with star ratings
rating_column str 'star_rating' Name of column containing numerical ratings
threshold_positive int 4 Minimum rating (inclusive) for positive sentiment
threshold_negative int 2 Maximum rating (inclusive) for negative sentiment

Returns

  • Type: pandas.DataFrame
  • Content: Copy of input DataFrame with additional 'sentiment' column

Classification Logic

The function creates three sentiment categories based on rating thresholds:

Default Thresholds (1-5 star scale):

  • Positive: Ratings ≥ 4 (4-5 stars)
  • Negative: Ratings ≤ 2 (1-2 stars)
  • Neutral: Ratings = 3 (3 stars)

Sentiment Mapping:

# Rating → Sentiment
1 star  → 'negative'
2 stars → 'negative'
3 stars → 'neutral'
4 stars → 'positive'
5 stars → 'positive'

Required Import

import numpy as np

Implementation Details

Uses numpy.select() for efficient conditional assignment:

  1. Condition Evaluation: Checks rating thresholds sequentially
  2. Label Assignment: Applies corresponding sentiment labels
  3. Default Handling: Assigns 'neutral' to ratings between thresholds

Usage Examples

# Basic usage with default thresholds
df_with_sentiment = create_binary_sentiment(df)

# Custom thresholds for different rating scales
# Example: 1-10 scale
df_10_scale = create_binary_sentiment(
    df, 
    threshold_positive=7,  # 7-10 = positive
    threshold_negative=4   # 1-4 = negative
)

# Strict positive/negative classification (no neutral)
df_strict = create_binary_sentiment(
    df,
    threshold_positive=4,  # 4-5 = positive
    threshold_negative=3   # 1-3 = negative
)

# Conservative classification (only extreme ratings)
df_conservative = create_binary_sentiment(
    df,
    threshold_positive=5,  # Only 5 = positive
    threshold_negative=1   # Only 1 = negative
)

# Custom rating column
df_headline_sentiment = create_binary_sentiment(
    df,
    rating_column='headline_rating',
    threshold_positive=4,
    threshold_negative=2
)

Data Validation and Quality

Input Validation:

# Check rating distribution before processing
print(df['star_rating'].value_counts().sort_index())

# Validate rating range
assert df['star_rating'].min() >= 1
assert df['star_rating'].max() <= 5

Output Analysis:

# Examine sentiment distribution
sentiment_dist = df_with_sentiment['sentiment'].value_counts()
print(sentiment_dist)
print(f"Positive: {sentiment_dist['positive']}")
print(f"Negative: {sentiment_dist['negative']}")
print(f"Neutral: {sentiment_dist['neutral']}")

Handling Edge Cases

Missing Ratings:

# Handle missing ratings before sentiment creation
df_clean = df.dropna(subset=['star_rating'])
df_with_sentiment = create_binary_sentiment(df_clean)

Non-Standard Rating Scales:

# Convert 0-10 scale to 1-5 scale
df['star_rating_normalized'] = (df['rating_0_10'] / 2) + 0.5
df_with_sentiment = create_binary_sentiment(df)

Business Logic Considerations

Threshold Selection Strategy:

Conservative Approach (Default):

  • Positive: 4-5 stars (clearly satisfied)
  • Negative: 1-2 stars (clearly dissatisfied)
  • Neutral: 3 stars (ambiguous)

Aggressive Approach:

  • Positive: 4-5 stars
  • Negative: 1-3 stars (treat neutral as negative)
  • No neutral category

Ultra-Conservative Approach:

  • Positive: 5 stars only (only highly satisfied)
  • Negative: 1-2 stars (clearly dissatisfied)
  • Neutral: 3-4 stars (everything else)

Integration with Machine Learning

Preparing for Classification:

# Create sentiment labels
df_sentiment = create_binary_sentiment(df)

# Filter for binary classification (remove neutral)
df_binary = df_sentiment[df_sentiment['sentiment'] != 'neutral']

# Encode labels for ML algorithms
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_binary['sentiment_encoded'] = le.fit_transform(df_binary['sentiment'])
# Result: 'positive' → 1, 'negative' → 0

Multi-class Classification:

# Keep all three categories
X = df_sentiment['processed_text_string']
y = df_sentiment['sentiment']
# Use with algorithms that support multi-class classification

Evaluation and Validation

Sentiment Distribution Analysis:

def analyze_sentiment_distribution(df):
    # Overall distribution
    overall = df['sentiment'].value_counts(normalize=True)
    
    # Distribution by rating
    by_rating = df.groupby('star_rating')['sentiment'].value_counts(normalize=True)
    
    return overall, by_rating

# Usage
overall_dist, rating_dist = analyze_sentiment_distribution(df_with_sentiment)

Threshold Sensitivity Analysis:

def threshold_sensitivity_analysis(df, rating_col='star_rating'):
    results = []
    
    for pos_thresh in [3, 4, 5]:
        for neg_thresh in [1, 2, 3]:
            if pos_thresh <= neg_thresh:
                continue
            
            df_temp = create_binary_sentiment(df, rating_col, pos_thresh, neg_thresh)
            dist = df_temp['sentiment'].value_counts(normalize=True)
            
            results.append({
                'pos_threshold': pos_thresh,
                'neg_threshold': neg_thresh,
                'positive_pct': dist.get('positive', 0),
                'negative_pct': dist.get('negative', 0),
                'neutral_pct': dist.get('neutral', 0)
            })
    
    return pd.DataFrame(results)

Common Use Cases

Sentiment Analysis Training:

  • Use as ground truth labels for supervised learning
  • Train models to predict sentiment from text

Business Metrics:

  • Calculate customer satisfaction percentages
  • Track sentiment trends over time
  • Compare sentiment across product categories

Data Filtering:

  • Focus analysis on clearly positive/negative reviews
  • Exclude ambiguous neutral ratings from certain analyses

A/B Testing:

  • Compare sentiment distributions between different groups
  • Measure impact of changes on customer satisfaction