3: Data Understanding - 7-teens/7-teens-DSA3101-2410-Project GitHub Wiki

Data Sources and Collection Methods

To conduct a comprehensive analysis on customer behavior, sales patterns, and pricing strategies, raw datasets were acquired from reputable sources on Kaggle. The datasets provide a rich variety of features that capture key aspects of the Shopee platform’s operations, customer interactions, and sales transactions.

  1. Shopee Sales Data Apr - May 2023
    This dataset includes sales records from April to May 2023 on the Shopee platform, capturing a recent snapshot of customer transactions and product performance. It offers valuable insights into:

    • Sales transactions: Details of each purchase, including order date and product information.
    • Product attributes: Data such as product category, price, and discounts, which are essential for analyzing customer purchasing patterns, demand fluctuations, and pricing effectiveness.
    • Customer engagement: Information on user interactions with the platform, including purchase frequency and product ratings, which help assess customer satisfaction and loyalty.

    By analyzing this dataset, we can gain insights into short-term trends in customer behavior and identify patterns in sales performance, demand, and product popularity. This dataset also serves as a benchmark to compare monthly fluctuations in sales and demand across different product categories.

  2. Shopee Code League 2020
    This dataset was sourced from the Shopee Code League, a data competition hosted by Shopee in 2020. It includes historical data that enables long-term analysis and trend identification. The dataset comprises:

    • Customer demographics and purchase behavior: Data on customer profiles and past purchasing activity, essential for customer segmentation and lifetime value (CLV) calculations.
    • Product-level data: Detailed information on various product categories, subcategories, and pricing, which is crucial for demand forecasting, pricing optimization, and competition analysis.
    • Sales and promotions: Information on discounts and promotional campaigns, allowing for evaluation of the effectiveness of different marketing channels and promotional activities on sales uplift.

    The combination of both datasets allows for a more holistic understanding of Shopee’s customer base, sales trends, and the effectiveness of different marketing and pricing strategies. While the 2023 dataset provides recent transactional data, the 2020 dataset offers historical insights that enable comparisons over time, revealing changes in customer preferences, pricing trends, and marketing strategy outcomes.

By leveraging these datasets, we can perform advanced analyses on customer purchasing behavior, forecast demand, and develop dynamic pricing models. Additionally, these data sources support customer segmentation and churn analysis, which are essential for building targeted retention and marketing strategies.


Data Dictionary

customer_data.csv

Column Name Description Type Range/Values
index Index of the record Integer Sequential
customer_id Unique identifier for each customer Integer Unique
subject_line_length Length of the email subject line Integer 0 - 100
last_open_day Last day the customer opened an email String Date format (e.g., "YYYY-MM-DD")
last_login_day Last day the customer logged in String Date format
last_checkout_day Last day the customer checked out String Date format
open_count_last_10_days Number of times the customer opened emails in the last 10 days Integer 0 - 10+
open_count_last_30_days Number of times the customer opened emails in the last 30 days Integer 0 - 30+
open_count_last_60_days Number of times the customer opened emails in the last 60 days Integer 0 - 60+
login_count_last_10_days Login count in the last 10 days Integer 0 - 10+
login_count_last_30_days Login count in the last 30 days Integer 0 - 30+
login_count_last_60_days Login count in the last 60 days Integer 0 - 60+
checkout_count_last_10_days Number of checkouts in the last 10 days Integer 0 - 10+
checkout_count_last_30_days Number of checkouts in the last 30 days Integer 0 - 30+
checkout_count_last_60_days Number of checkouts in the last 60 days Integer 0 - 60+
open_flag Indicator if customer has opened an email recently Integer 0 (No), 1 (Yes)
row_id Unique row identifier Integer Unique
attr_1 Attribute 1 for customer data Float Varies
attr_2 Attribute 2 for customer data Float Varies
attr_3 Attribute 3 for customer data Float Varies
age Age of the customer Float 0 - 100+
domain Customer's email domain String Varies (e.g., "gmail.com")
gender Customer's gender String "Male", "Female", etc.
checkout_count_last_6_months Number of checkouts in the last 6 months Integer 0 - 180+
checkout_count_last_1_year Number of checkouts in the last year Integer 0 - 365+

orders_data.csv

Column Name Description Type Range/Values
order_id Unique identifier for each order Integer Unique
order_time Date and time of the order String Date and time format
product_id Product identifier for the ordered item Integer Unique
customer_id Identifier for the customer who placed the order Integer Unique
is_campaign Whether the order was part of a campaign Boolean TRUE, FALSE
additional_discount Additional discount applied to the order Float 0 - 100+
campaign_name Name of the associated campaign String Varies

products_data.csv

Column Name Description Type Range/Values
price_ori Original price of the product Float 0 - 10000+
item_category_detail Detailed category of the product String Varies
title Title of the product String Varies
item_rating Customer rating for the product String Varies (rating format)
seller_name Name of the seller String Varies
price_actual Actual price after discounts Float 0 - 10000+
total_rating Total rating score Float 0 - 5
total_sold Number of units sold Float 0 - 1000+
favorite Number of times marked as favorite Float 0 - 1000+
desc Description of the product String Varies
Stock Available stock for the product Integer 0 - 1000+
Ships_From Location from which the product is shipped String Varies
product_id Unique identifier for the product Integer Unique
main_category Main category of the product String Varies
mid_category Mid-level category String Varies
sub_category Sub-category String Varies

shopee_sales_data.csv

Column Name Description Type Range/Values
order_id Identifier for the order Integer Unique
product_id Identifier for the product Integer Unique
session_id Session identifier for the transaction Integer Unique
is_campaign Whether the order was part of a campaign Integer 0 (No), 1 (Yes)
campaign Campaign associated with the sale String Varies
main_category Main category of the product String Varies
price Actual price of the product Float 0 - 10000+
total_sold Total units sold for the product Float 0 - 1000+
title Title of the product String Varies
Stock Available stock for the product Integer 0 - 1000+
customer_id Customer identifier Integer Unique
promotional_discount_type Type of promotional discount String Varies
gender Gender of the customer String "Male", "Female", etc.
price_range Price range category String Varies
order_time Date and time of the order String Date and time format
discount Discount amount Integer 0 - 100+

campaign_data.csv

Column Name Description Type Range/Values
campaign_id Unique identifier for each campaign Integer Unique
campaign_type Type of campaign String Varies (e.g., "Email", "SMS")
campaign_cost Cost of running the campaign Integer 0 - 10000+
session_id Session identifier Integer Unique
start_date Start date of the campaign String Date format
end_date End date of the campaign String Date format
marketing_channel Marketing channel used for the campaign String Varies
promotional_discount_type Type of discount offered String Varies
promotional_discount_value Discount value Float 0 - 100+
gender Target gender for the campaign String "Male", "Female", etc.
age_group Age group targeted by the campaign String Varies (e.g., "18-24")
is_click Whether the campaign was clicked Integer 0 (No), 1 (Yes)
customer_id Targeted customer identifier Integer Unique
target_audience Description of the target audience String Varies

marketing_data.csv

Column Name Description Type Range/Values
date Date of record String Date format
channel_name Name of the marketing channel String Varies (e.g., "Facebook", "Google")
channel_expenditure Expenditure on the marketing channel Float 0 - 10000+
channel_visitors Number of visitors via the marketing channel Float 0 - 100000+
channel_sales Number of sales from the channel Float 0 - 1000+
conversion_rate Conversion rate of the channel Float 0 - 1
average_order_value Average order value from the channel Float 0 - 10000+
total_revenue Total revenue generated from the channel Float 0 - 100000+
click_through_rate Click-through rate for advertisements Float 0 - 1
bounce_rate Bounce rate of visitors Float 0 - 1
day Day of the week String e.g., "Monday", "Tuesday"

Entity Relationship Diagram (ERD)