[function] load_data() - P3chys/textmining GitHub Wiki

Function: load_data()

Purpose

Loads Amazon reviews dataset from a CSV file with tab-separated values and performs initial data cleaning.

Syntax

load_data(file_path, sample_size=None)

Parameters

Parameter Type Default Description
file_path str Required Path to the CSV file containing Amazon reviews data
sample_size int or None None Number of rows to randomly sample from the dataset. If None, loads entire dataset

Returns

  • Type: pandas.DataFrame
  • Content: Cleaned dataset with 15 columns and processed star ratings

Column Structure

The function expects data with these 15 columns (tab-separated):

  1. marketplace - Amazon marketplace identifier
  2. customer_id - Unique customer identifier
  3. review_id - Unique review identifier
  4. product_id - Product identifier
  5. product_parent - Parent product identifier
  6. product_title - Product name/title
  7. product_category - Product category
  8. star_rating - Rating (1-5 stars)
  9. helpful_votes - Number of helpful votes
  10. total_votes - Total votes received
  11. vine - Vine program participation flag
  12. verified_purchase - Purchase verification status
  13. review_headline - Review title/headline
  14. review_body - Full review text
  15. review_date - Date of review

Data Cleaning Operations

  1. Rating Conversion: Converts star_rating to numeric format using pd.to_numeric() with errors='coerce'
  2. Missing Data Handling: Drops rows where star rating conversion failed (NaN values)
  3. Type Casting: Converts cleaned star ratings to integer type
  4. Sampling: If sample_size specified, randomly samples data with random_state=42 for reproducibility

Error Handling

  • Non-numeric values in star_rating column are converted to NaN, then rows are dropped
  • Missing or malformed ratings are automatically removed from dataset

Usage Example

# Load entire dataset
full_data = load_data('amazon_reviews.csv')

# Load sample of 10,000 reviews
sample_data = load_data('amazon_reviews.csv', sample_size=10000)