[function] load_data() - P3chys/textmining GitHub Wiki

Function: `load_data()`

Purpose

Loads Amazon reviews dataset from a CSV file with tab-separated values and performs initial data cleaning.

Syntax

load_data(file_path, sample_size=None)

Parameters

Parameter	Type	Default	Description
`file_path`	str	Required	Path to the CSV file containing Amazon reviews data
`sample_size`	int or None	None	Number of rows to randomly sample from the dataset. If None, loads entire dataset

Returns

Type: pandas.DataFrame
Content: Cleaned dataset with 15 columns and processed star ratings

Column Structure

The function expects data with these 15 columns (tab-separated):

marketplace - Amazon marketplace identifier
customer_id - Unique customer identifier
review_id - Unique review identifier
product_id - Product identifier
product_parent - Parent product identifier
product_title - Product name/title
product_category - Product category
star_rating - Rating (1-5 stars)
helpful_votes - Number of helpful votes
total_votes - Total votes received
vine - Vine program participation flag
verified_purchase - Purchase verification status
review_headline - Review title/headline
review_body - Full review text
review_date - Date of review

Data Cleaning Operations

Rating Conversion: Converts star_rating to numeric format using pd.to_numeric() with errors='coerce'
Missing Data Handling: Drops rows where star rating conversion failed (NaN values)
Type Casting: Converts cleaned star ratings to integer type
Sampling: If sample_size specified, randomly samples data with random_state=42 for reproducibility

Error Handling

Non-numeric values in star_rating column are converted to NaN, then rows are dropped
Missing or malformed ratings are automatically removed from dataset

Usage Example

# Load entire dataset
full_data = load_data('amazon_reviews.csv')

# Load sample of 10,000 reviews
sample_data = load_data('amazon_reviews.csv', sample_size=10000)

[function] load_data() - P3chys/textmining GitHub Wiki

Function: load_data()

Purpose

Syntax

Parameters

Returns

Column Structure

Data Cleaning Operations

Error Handling

Usage Example

Function: `load_data()`