[function] load_data() - P3chys/textmining GitHub Wiki
load_data()
Function: Purpose
Loads Amazon reviews dataset from a CSV file with tab-separated values and performs initial data cleaning.
Syntax
load_data(file_path, sample_size=None)
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
file_path |
str | Required | Path to the CSV file containing Amazon reviews data |
sample_size |
int or None | None | Number of rows to randomly sample from the dataset. If None, loads entire dataset |
Returns
- Type: pandas.DataFrame
- Content: Cleaned dataset with 15 columns and processed star ratings
Column Structure
The function expects data with these 15 columns (tab-separated):
marketplace
- Amazon marketplace identifiercustomer_id
- Unique customer identifierreview_id
- Unique review identifierproduct_id
- Product identifierproduct_parent
- Parent product identifierproduct_title
- Product name/titleproduct_category
- Product categorystar_rating
- Rating (1-5 stars)helpful_votes
- Number of helpful votestotal_votes
- Total votes receivedvine
- Vine program participation flagverified_purchase
- Purchase verification statusreview_headline
- Review title/headlinereview_body
- Full review textreview_date
- Date of review
Data Cleaning Operations
- Rating Conversion: Converts
star_rating
to numeric format usingpd.to_numeric()
witherrors='coerce'
- Missing Data Handling: Drops rows where star rating conversion failed (NaN values)
- Type Casting: Converts cleaned star ratings to integer type
- Sampling: If
sample_size
specified, randomly samples data withrandom_state=42
for reproducibility
Error Handling
- Non-numeric values in
star_rating
column are converted to NaN, then rows are dropped - Missing or malformed ratings are automatically removed from dataset
Usage Example
# Load entire dataset
full_data = load_data('amazon_reviews.csv')
# Load sample of 10,000 reviews
sample_data = load_data('amazon_reviews.csv', sample_size=10000)