[process] visualization of data - P3chys/textmining GitHub Wiki
Overview
A collection of functions for creating standardized visualizations of Amazon reviews data analysis, providing insights into rating patterns, review characteristics, and sentiment distributions.
plot_rating_distribution()
Function: Purpose
Creates a count plot visualization showing the distribution of star ratings across the dataset.
Syntax
plot_rating_distribution(df, rating_column='star_rating')
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
df |
pandas.DataFrame | Required | DataFrame containing review data |
rating_column |
str | 'star_rating' | Column name containing star ratings (1-5) |
Output
- File: Saves plot as 'rating_distribution.png'
- Format: Count plot with viridis color palette
- Size: 10x6 inches
Visualization Details
- Plot Type: Seaborn countplot
- X-axis: Star ratings (1, 2, 3, 4, 5)
- Y-axis: Count of reviews for each rating
- Color Scheme: Viridis (continuous color palette)
plot_review_length_distribution()
Function: Purpose
Creates a dual-panel visualization showing review length patterns and their relationship with star ratings.
Syntax
plot_review_length_distribution(df, length_column='review_length')
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
df |
pandas.DataFrame | Required | DataFrame containing review data |
length_column |
str | 'review_length' | Column name containing review length in words |
Output
- File: Saves plot as 'review_length_distribution.png'
- Format: Two-panel subplot layout
- Size: 12x6 inches
Visualization Details
Panel 1 (Left): Length Distribution
- Plot Type: Seaborn histogram with KDE overlay
- Bins: 50 bins for granular distribution
- X-axis: Number of words in review
- Y-axis: Frequency count
Panel 2 (Right): Length by Rating
- Plot Type: Seaborn boxplot
- X-axis: Star rating (1-5)
- Y-axis: Number of words in review
- Shows: Median, quartiles, outliers for each rating
Layout
- Uses
plt.tight_layout()
for optimal spacing - Automatic subplot arrangement (1 row, 2 columns)
plot_sentiment_distribution()
Function: Purpose
Creates a count plot showing the distribution of binary sentiment classifications.
Syntax
plot_sentiment_distribution(df, sentiment_column='sentiment')
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
df |
pandas.DataFrame | Required | DataFrame containing review data |
sentiment_column |
str | 'sentiment' | Column name containing sentiment labels |
Output
- File: Saves plot as 'sentiment_distribution.png'
- Format: Count plot with Set2 color palette
- Size: 8x6 inches
Visualization Details
- Plot Type: Seaborn countplot
- X-axis: Sentiment labels (typically 'positive', 'negative')
- Y-axis: Count of reviews for each sentiment
- Color Scheme: Set2 (categorical color palette)
Common Features
File Management
- All functions use
plt.close()
to prevent memory leaks - Automatic file saving to current directory
- PNG format for high-quality output
Dependencies
import matplotlib.pyplot as plt
import seaborn as sns
Error Handling
- Functions assume input DataFrames contain specified columns
- No explicit error handling implemented (caller responsibility)
Usage Example
# Create all visualizations for a dataset
df = load_data('amazon_reviews.csv', sample_size=10000)
df = explore_data(df) # Adds review_length column
# Generate rating distribution
plot_rating_distribution(df)
# Generate review length analysis
plot_review_length_distribution(df)
# Generate sentiment distribution (requires sentiment analysis)
df['sentiment'] = classify_sentiment(df['review_body'])
plot_sentiment_distribution(df)
Integration Notes
- Preprocessing:
plot_review_length_distribution()
requiresreview_length
column (created byexplore_data()
) - Sentiment Analysis:
plot_sentiment_distribution()
requires prior sentiment classification - File Organization: Consider organizing output files in designated visualization directory