[process] visualization of data - P3chys/textmining GitHub Wiki

Overview

A collection of functions for creating standardized visualizations of Amazon reviews data analysis, providing insights into rating patterns, review characteristics, and sentiment distributions.


Function: plot_rating_distribution()

Purpose

Creates a count plot visualization showing the distribution of star ratings across the dataset.

Syntax

plot_rating_distribution(df, rating_column='star_rating')

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing review data
rating_column str 'star_rating' Column name containing star ratings (1-5)

Output

  • File: Saves plot as 'rating_distribution.png'
  • Format: Count plot with viridis color palette
  • Size: 10x6 inches

Visualization Details

  • Plot Type: Seaborn countplot
  • X-axis: Star ratings (1, 2, 3, 4, 5)
  • Y-axis: Count of reviews for each rating
  • Color Scheme: Viridis (continuous color palette)

Function: plot_review_length_distribution()

Purpose

Creates a dual-panel visualization showing review length patterns and their relationship with star ratings.

Syntax

plot_review_length_distribution(df, length_column='review_length')

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing review data
length_column str 'review_length' Column name containing review length in words

Output

  • File: Saves plot as 'review_length_distribution.png'
  • Format: Two-panel subplot layout
  • Size: 12x6 inches

Visualization Details

Panel 1 (Left): Length Distribution

  • Plot Type: Seaborn histogram with KDE overlay
  • Bins: 50 bins for granular distribution
  • X-axis: Number of words in review
  • Y-axis: Frequency count

Panel 2 (Right): Length by Rating

  • Plot Type: Seaborn boxplot
  • X-axis: Star rating (1-5)
  • Y-axis: Number of words in review
  • Shows: Median, quartiles, outliers for each rating

Layout

  • Uses plt.tight_layout() for optimal spacing
  • Automatic subplot arrangement (1 row, 2 columns)

Function: plot_sentiment_distribution()

Purpose

Creates a count plot showing the distribution of binary sentiment classifications.

Syntax

plot_sentiment_distribution(df, sentiment_column='sentiment')

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing review data
sentiment_column str 'sentiment' Column name containing sentiment labels

Output

  • File: Saves plot as 'sentiment_distribution.png'
  • Format: Count plot with Set2 color palette
  • Size: 8x6 inches

Visualization Details

  • Plot Type: Seaborn countplot
  • X-axis: Sentiment labels (typically 'positive', 'negative')
  • Y-axis: Count of reviews for each sentiment
  • Color Scheme: Set2 (categorical color palette)

Common Features

File Management

  • All functions use plt.close() to prevent memory leaks
  • Automatic file saving to current directory
  • PNG format for high-quality output

Dependencies

import matplotlib.pyplot as plt
import seaborn as sns

Error Handling

  • Functions assume input DataFrames contain specified columns
  • No explicit error handling implemented (caller responsibility)

Usage Example

# Create all visualizations for a dataset
df = load_data('amazon_reviews.csv', sample_size=10000)
df = explore_data(df)  # Adds review_length column

# Generate rating distribution
plot_rating_distribution(df)

# Generate review length analysis
plot_review_length_distribution(df)

# Generate sentiment distribution (requires sentiment analysis)
df['sentiment'] = classify_sentiment(df['review_body'])
plot_sentiment_distribution(df)

Integration Notes

  • Preprocessing: plot_review_length_distribution() requires review_length column (created by explore_data())
  • Sentiment Analysis: plot_sentiment_distribution() requires prior sentiment classification
  • File Organization: Consider organizing output files in designated visualization directory