[function] explore_data() - P3chys/textmining GitHub Wiki

Function: explore_data()

Purpose

Performs comprehensive exploratory data analysis on Amazon reviews dataset and calculates key statistics for initial understanding.

Syntax

explore_data(df)

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing Amazon reviews data (output from load_data())

Returns

  • Type: dict
  • Content: Dictionary containing various exploratory statistics and metrics

Return Dictionary Keys

Key Type Description
'shape' tuple Dataset dimensions (rows, columns)
'missing_values' pandas.Series Count of missing values per column
'review_length_stats' pandas.Series Descriptive statistics of review length in words
'verified_purchase_dist' pandas.Series Distribution of verified vs unverified purchases

Data Processing Operations

  1. Shape Analysis: Determines dataset size (rows × columns)
  2. Missing Value Detection: Counts null/NaN values in each column
  3. Text Length Calculation:
    • Creates new column review_length
    • Counts words in each review using len(str(x).split())
    • Generates descriptive statistics (mean, std, min, max, quartiles)
  4. Purchase Verification Analysis: Counts verified vs unverified purchase reviews

Side Effects

  • Modifies DataFrame: Adds review_length column to input DataFrame
  • This column persists after function execution

Statistical Output Details

  • review_length_stats: Includes count, mean, std, min, 25%, 50%, 75%, max
  • verified_purchase_dist: Shows counts for 'Y' (verified) and 'N' (unverified)

Usage Example

# Load and explore data
df = load_data('amazon_reviews.csv', sample_size=5000)
exploration_results = explore_data(df)

# Access specific statistics
print(f"Dataset shape: {exploration_results['shape']}")
print(f"Average review length: {exploration_results['review_length_stats']['mean']:.2f} words")
print(f"Missing values:\n{exploration_results['missing_values']}")