[function] load_data() - P3chys/textmining GitHub Wiki
Function: load_data()
Purpose
Loads Amazon reviews dataset from a CSV file with tab-separated values and performs initial data cleaning.
Syntax
load_data(file_path, sample_size=None)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str | Required | Path to the CSV file containing Amazon reviews data |
sample_size |
int or None | None | Number of rows to randomly sample from the dataset. If None, loads entire dataset |
Returns
- Type: pandas.DataFrame
- Content: Cleaned dataset with 15 columns and processed star ratings
Column Structure
The function expects data with these 15 columns (tab-separated):
marketplace- Amazon marketplace identifiercustomer_id- Unique customer identifierreview_id- Unique review identifierproduct_id- Product identifierproduct_parent- Parent product identifierproduct_title- Product name/titleproduct_category- Product categorystar_rating- Rating (1-5 stars)helpful_votes- Number of helpful votestotal_votes- Total votes receivedvine- Vine program participation flagverified_purchase- Purchase verification statusreview_headline- Review title/headlinereview_body- Full review textreview_date- Date of review
Data Cleaning Operations
- Rating Conversion: Converts
star_ratingto numeric format usingpd.to_numeric()witherrors='coerce' - Missing Data Handling: Drops rows where star rating conversion failed (NaN values)
- Type Casting: Converts cleaned star ratings to integer type
- Sampling: If
sample_sizespecified, randomly samples data withrandom_state=42for reproducibility
Error Handling
- Non-numeric values in
star_ratingcolumn are converted to NaN, then rows are dropped - Missing or malformed ratings are automatically removed from dataset
Usage Example
# Load entire dataset
full_data = load_data('amazon_reviews.csv')
# Load sample of 10,000 reviews
sample_data = load_data('amazon_reviews.csv', sample_size=10000)