3: Data Understanding - 7-teens/7-teens-DSA3101-2410-Project GitHub Wiki
Data Sources and Collection Methods
To conduct a comprehensive analysis on customer behavior, sales patterns, and pricing strategies, raw datasets were acquired from reputable sources on Kaggle. The datasets provide a rich variety of features that capture key aspects of the Shopee platform’s operations, customer interactions, and sales transactions.
-
Shopee Sales Data Apr - May 2023
This dataset includes sales records from April to May 2023 on the Shopee platform, capturing a recent snapshot of customer transactions and product performance. It offers valuable insights into:- Sales transactions: Details of each purchase, including order date and product information.
- Product attributes: Data such as product category, price, and discounts, which are essential for analyzing customer purchasing patterns, demand fluctuations, and pricing effectiveness.
- Customer engagement: Information on user interactions with the platform, including purchase frequency and product ratings, which help assess customer satisfaction and loyalty.
By analyzing this dataset, we can gain insights into short-term trends in customer behavior and identify patterns in sales performance, demand, and product popularity. This dataset also serves as a benchmark to compare monthly fluctuations in sales and demand across different product categories.
-
Shopee Code League 2020
This dataset was sourced from the Shopee Code League, a data competition hosted by Shopee in 2020. It includes historical data that enables long-term analysis and trend identification. The dataset comprises:- Customer demographics and purchase behavior: Data on customer profiles and past purchasing activity, essential for customer segmentation and lifetime value (CLV) calculations.
- Product-level data: Detailed information on various product categories, subcategories, and pricing, which is crucial for demand forecasting, pricing optimization, and competition analysis.
- Sales and promotions: Information on discounts and promotional campaigns, allowing for evaluation of the effectiveness of different marketing channels and promotional activities on sales uplift.
The combination of both datasets allows for a more holistic understanding of Shopee’s customer base, sales trends, and the effectiveness of different marketing and pricing strategies. While the 2023 dataset provides recent transactional data, the 2020 dataset offers historical insights that enable comparisons over time, revealing changes in customer preferences, pricing trends, and marketing strategy outcomes.
By leveraging these datasets, we can perform advanced analyses on customer purchasing behavior, forecast demand, and develop dynamic pricing models. Additionally, these data sources support customer segmentation and churn analysis, which are essential for building targeted retention and marketing strategies.
Data Dictionary
customer_data.csv
Column Name | Description | Type | Range/Values |
---|---|---|---|
index | Index of the record | Integer | Sequential |
customer_id | Unique identifier for each customer | Integer | Unique |
subject_line_length | Length of the email subject line | Integer | 0 - 100 |
last_open_day | Last day the customer opened an email | String | Date format (e.g., "YYYY-MM-DD") |
last_login_day | Last day the customer logged in | String | Date format |
last_checkout_day | Last day the customer checked out | String | Date format |
open_count_last_10_days | Number of times the customer opened emails in the last 10 days | Integer | 0 - 10+ |
open_count_last_30_days | Number of times the customer opened emails in the last 30 days | Integer | 0 - 30+ |
open_count_last_60_days | Number of times the customer opened emails in the last 60 days | Integer | 0 - 60+ |
login_count_last_10_days | Login count in the last 10 days | Integer | 0 - 10+ |
login_count_last_30_days | Login count in the last 30 days | Integer | 0 - 30+ |
login_count_last_60_days | Login count in the last 60 days | Integer | 0 - 60+ |
checkout_count_last_10_days | Number of checkouts in the last 10 days | Integer | 0 - 10+ |
checkout_count_last_30_days | Number of checkouts in the last 30 days | Integer | 0 - 30+ |
checkout_count_last_60_days | Number of checkouts in the last 60 days | Integer | 0 - 60+ |
open_flag | Indicator if customer has opened an email recently | Integer | 0 (No), 1 (Yes) |
row_id | Unique row identifier | Integer | Unique |
attr_1 | Attribute 1 for customer data | Float | Varies |
attr_2 | Attribute 2 for customer data | Float | Varies |
attr_3 | Attribute 3 for customer data | Float | Varies |
age | Age of the customer | Float | 0 - 100+ |
domain | Customer's email domain | String | Varies (e.g., "gmail.com") |
gender | Customer's gender | String | "Male", "Female", etc. |
checkout_count_last_6_months | Number of checkouts in the last 6 months | Integer | 0 - 180+ |
checkout_count_last_1_year | Number of checkouts in the last year | Integer | 0 - 365+ |
orders_data.csv
Column Name | Description | Type | Range/Values |
---|---|---|---|
order_id | Unique identifier for each order | Integer | Unique |
order_time | Date and time of the order | String | Date and time format |
product_id | Product identifier for the ordered item | Integer | Unique |
customer_id | Identifier for the customer who placed the order | Integer | Unique |
is_campaign | Whether the order was part of a campaign | Boolean | TRUE, FALSE |
additional_discount | Additional discount applied to the order | Float | 0 - 100+ |
campaign_name | Name of the associated campaign | String | Varies |
products_data.csv
Column Name | Description | Type | Range/Values |
---|---|---|---|
price_ori | Original price of the product | Float | 0 - 10000+ |
item_category_detail | Detailed category of the product | String | Varies |
title | Title of the product | String | Varies |
item_rating | Customer rating for the product | String | Varies (rating format) |
seller_name | Name of the seller | String | Varies |
price_actual | Actual price after discounts | Float | 0 - 10000+ |
total_rating | Total rating score | Float | 0 - 5 |
total_sold | Number of units sold | Float | 0 - 1000+ |
favorite | Number of times marked as favorite | Float | 0 - 1000+ |
desc | Description of the product | String | Varies |
Stock | Available stock for the product | Integer | 0 - 1000+ |
Ships_From | Location from which the product is shipped | String | Varies |
product_id | Unique identifier for the product | Integer | Unique |
main_category | Main category of the product | String | Varies |
mid_category | Mid-level category | String | Varies |
sub_category | Sub-category | String | Varies |
shopee_sales_data.csv
Column Name | Description | Type | Range/Values |
---|---|---|---|
order_id | Identifier for the order | Integer | Unique |
product_id | Identifier for the product | Integer | Unique |
session_id | Session identifier for the transaction | Integer | Unique |
is_campaign | Whether the order was part of a campaign | Integer | 0 (No), 1 (Yes) |
campaign | Campaign associated with the sale | String | Varies |
main_category | Main category of the product | String | Varies |
price | Actual price of the product | Float | 0 - 10000+ |
total_sold | Total units sold for the product | Float | 0 - 1000+ |
title | Title of the product | String | Varies |
Stock | Available stock for the product | Integer | 0 - 1000+ |
customer_id | Customer identifier | Integer | Unique |
promotional_discount_type | Type of promotional discount | String | Varies |
gender | Gender of the customer | String | "Male", "Female", etc. |
price_range | Price range category | String | Varies |
order_time | Date and time of the order | String | Date and time format |
discount | Discount amount | Integer | 0 - 100+ |
campaign_data.csv
Column Name | Description | Type | Range/Values |
---|---|---|---|
campaign_id | Unique identifier for each campaign | Integer | Unique |
campaign_type | Type of campaign | String | Varies (e.g., "Email", "SMS") |
campaign_cost | Cost of running the campaign | Integer | 0 - 10000+ |
session_id | Session identifier | Integer | Unique |
start_date | Start date of the campaign | String | Date format |
end_date | End date of the campaign | String | Date format |
marketing_channel | Marketing channel used for the campaign | String | Varies |
promotional_discount_type | Type of discount offered | String | Varies |
promotional_discount_value | Discount value | Float | 0 - 100+ |
gender | Target gender for the campaign | String | "Male", "Female", etc. |
age_group | Age group targeted by the campaign | String | Varies (e.g., "18-24") |
is_click | Whether the campaign was clicked | Integer | 0 (No), 1 (Yes) |
customer_id | Targeted customer identifier | Integer | Unique |
target_audience | Description of the target audience | String | Varies |
marketing_data.csv
Column Name | Description | Type | Range/Values |
---|---|---|---|
date | Date of record | String | Date format |
channel_name | Name of the marketing channel | String | Varies (e.g., "Facebook", "Google") |
channel_expenditure | Expenditure on the marketing channel | Float | 0 - 10000+ |
channel_visitors | Number of visitors via the marketing channel | Float | 0 - 100000+ |
channel_sales | Number of sales from the channel | Float | 0 - 1000+ |
conversion_rate | Conversion rate of the channel | Float | 0 - 1 |
average_order_value | Average order value from the channel | Float | 0 - 10000+ |
total_revenue | Total revenue generated from the channel | Float | 0 - 100000+ |
click_through_rate | Click-through rate for advertisements | Float | 0 - 1 |
bounce_rate | Bounce rate of visitors | Float | 0 - 1 |
day | Day of the week | String | e.g., "Monday", "Tuesday" |