[function] explore_data() - P3chys/textmining GitHub Wiki
Function: explore_data()
Purpose
Performs comprehensive exploratory data analysis on Amazon reviews dataset and calculates key statistics for initial understanding.
Syntax
explore_data(df)
Parameters
| Parameter |
Type |
Default |
Description |
df |
pandas.DataFrame |
Required |
DataFrame containing Amazon reviews data (output from load_data()) |
Returns
- Type: dict
- Content: Dictionary containing various exploratory statistics and metrics
Return Dictionary Keys
| Key |
Type |
Description |
'shape' |
tuple |
Dataset dimensions (rows, columns) |
'missing_values' |
pandas.Series |
Count of missing values per column |
'review_length_stats' |
pandas.Series |
Descriptive statistics of review length in words |
'verified_purchase_dist' |
pandas.Series |
Distribution of verified vs unverified purchases |
Data Processing Operations
- Shape Analysis: Determines dataset size (rows × columns)
- Missing Value Detection: Counts null/NaN values in each column
- Text Length Calculation:
- Creates new column
review_length
- Counts words in each review using
len(str(x).split())
- Generates descriptive statistics (mean, std, min, max, quartiles)
- Purchase Verification Analysis: Counts verified vs unverified purchase reviews
Side Effects
- Modifies DataFrame: Adds
review_length column to input DataFrame
- This column persists after function execution
Statistical Output Details
- review_length_stats: Includes count, mean, std, min, 25%, 50%, 75%, max
- verified_purchase_dist: Shows counts for 'Y' (verified) and 'N' (unverified)
Usage Example
# Load and explore data
df = load_data('amazon_reviews.csv', sample_size=5000)
exploration_results = explore_data(df)
# Access specific statistics
print(f"Dataset shape: {exploration_results['shape']}")
print(f"Average review length: {exploration_results['review_length_stats']['mean']:.2f} words")
print(f"Missing values:\n{exploration_results['missing_values']}")