4: Data Preparation - 7-teens/7-teens-DSA3101-2410-Project GitHub Wiki

Data Cleaning Processes

This section focuses on the essential steps for preparing the dataset for analysis. It includes data loading, cleaning, transformation, and feature engineering processes to ensure the dataset is ready for modeling and analysis. Key tasks covered:

Data Collection & Integration
Data Inspection and Cleaning
Synthetic Data Generation

1. Data Collection & Integration

customer and customer_age dataframe: Obtained from Shopee Code League
products: Obtained from scraped Shopee platform Data

customer = pd.read_csv("train.csv") # main user data file
customer_age = pd.read_csv("users.csv") # age domain etc
products = pd.read_csv("products.csv")

2. Data Inspection and Cleaning

Key Tasks Covered

Analyze data types and structures: Verify data types, formats, and structure consistency.
Identify missing or incomplete data: Locate, quantify, and assess patterns of missing data.
Detect outliers and anomalies: Identify and investigate unusual or extreme values.
Assess data distribution and patterns: Understand trends, distributions, and relationships in the data.

Data Inspection for Customer Table

Left Join Two Customer Tables: Merge tables to enrich data while keeping all records from the first table.
Filter for Country Code 6 (Singapore): Focus analysis on a specific region.
Drop Unnecessary Columns: Remove irrelevant or redundant fields to streamline the dataset.
Rename Columns: Use clear, descriptive names for better readability and consistency.

customer = pd.merge(customer, customer_age, how='left', on='user_id')
SG_customer = customer[customer.country_code == 6].reset_index()
SG_customer = SG_customer.drop(columns = ['index', 'country_code', 'grass_date'])
SG_customer = SG_customer.rename(columns={'user_id': 'customer_id'})

# Assuming customer['last_checkout_day'] contains non-integer values
SG_customer['last_checkout_day'] = pd.to_numeric(SG_customer['last_checkout_day'], errors='coerce').astype('Int64')
SG_customer
SG_customer[SG_customer['last_checkout_day'] < 60]
SG_customer.head(5)

#### Checking For Duplicates
SG_customer.customer_id.value_counts()[SG_customer.customer_id.value_counts() >= 2]
SG_customer = SG_customer.drop_duplicates(subset='customer_id', keep='first')
SG_customer = SG_customer.reset_index()

### Data Cleaning For Products

Data Inspection for Products Dataset

We are working with data scraped from Shopee Malaysia. Given the similarities in pricing and delivery between Shopee Malaysia and Singapore, we take the following steps:

Change Data Types and Currency:
- Convert necessary columns from string to integer.
- Change currency from MYR to SGD using today's exchange rate.
Extract Information in Specifications:
- Extract details such as delivery location and other relevant attributes from the specifications column.
Categorization:
- Apply a function to extract main category, middle category, and subcategory for better organization.
Drop Unnecessary Columns:
- Remove irrelevant or redundant columns to simplify the dataset.

# Use regular expressions to extract 'Stock' and 'Ships From' values
products['Stock'] = products['specification'].str.extract(r'Stock\s(\d+)').fillna(0).astype(int)
products['Ships_From'] = products['specification'].str.extract(r'Ships From\s(.+)$')

# MYR to SGD
products['price_ori'] = products['price_ori']/3.21118 
products['price_actual'] = products['price_actual']/3.21118 
def convert_rating(value):
    try:
        if 'k' in value:
            # Convert values like '8.1k' to 8100
            return int(float(value.replace('k', '')) * 1000)
        else:
            # Convert normal string numbers to integers
            return int(value)
    except:
        return np.nan

# Apply the conversion function to the 'total_rating' column
products['total_rating'] = products['total_rating'].apply(convert_rating)
products['total_sold'] = products['total_sold'].apply(convert_rating)
def convert_favorite(value):
    # Remove 'Favorite (' and convert the result to string
    try:
        value = value.replace('Favorite (', '').replace(')', '')
    except:
        return np.nan
    return(convert_rating(value))

products['favorite'] = products['favorite'].apply(convert_favorite)
products= products.drop(['delivery', 'specification', 'w_date', 'link_ori', 'idElastic', 'sitename', 'idHash', 'id', 'pict_link', 'timestamp'], axis=1)
products['product_id'] = products.index + 1
location_to_country_non_malaysia = {
    "Mainland China": "China",
    "Indonesia": "Indonesia",
    "Korea": "South Korea",
    "Vietnam": "Vietnam",
    "Taiwan": "Taiwan",
    "Thailand": "Thailand",
    "KOTA JAKARTA SELATAN - KEBAYORAN BARU, DKI JAKARTA, ID": "Indonesia",
    "KAB. MOJOKERTO - MOJOSARI, JAWA TIMUR, ID": "Indonesia",
    "KOTA JAKARTA UTARA - KELAPA GADING, DKI JAKARTA, ID": "Indonesia",
    "KOTA JAKARTA PUSAT - TANAH ABANG, DKI JAKARTA, ID": "Indonesia",
    "KOTA JAKARTA BARAT - KEBON JERUK, DKI JAKARTA, ID": "Indonesia"
}

# Apply mapping to 'Ship_From' column
products['Ships_From'] = products['Ships_From'].apply(lambda x: location_to_country_non_malaysia.get(x, "Malaysia"))
products['main_category', 'mid_category', 'sub_category'](/7-teens/7-teens-DSA3101-2410-Project/wiki/'main_category',-'mid_category',-'sub_category') = products['item_category_detail'].str.split('|').apply(lambda x: [x[1].strip(), x[-2].strip(), x[-1].strip()]).apply(pd.Series)

# Function to determine Ships_From based on main_category
def assign_ships_from(row):
    if row['main_category'] == 'Electronics':
        return 'Singapore' if np.random.rand() < 0.8 else row['Ships_From']
    else:
        return 'Singapore' if np.random.rand() < 0.4 else row['Ships_From']

# Apply function to 'Ships_From' column based on 'main_category'
products['Ships_From'] = products.apply(assign_ships_from, axis=1)

This code processes and cleans a dataset of products, focusing on extracting key attributes, normalizing values, and deriving additional insights.

Extracting 'Stock' and 'Ships From':
- Use regular expressions to extract the Stock quantity and the Ships From location from the specification column.
- Missing Stock values are filled with 0 and converted to integers.
Currency Conversion (MYR to SGD):
- Convert price_ori and price_actual from Malaysian Ringgit (MYR) to Singapore Dollars (SGD) using the exchange rate (3.21118).
Rating and Sold Count Conversion:
- Convert total_rating and total_sold values, handling formats like '8.1k' (representing 8100) and other numeric strings.
Favorite Count Conversion:
- Extract and clean favorite counts from strings like 'Favorite (123)' using a helper function and the same conversion logic as ratings.
Dropping Irrelevant Columns:
- Remove unnecessary columns (e.g., delivery, specification, idHash) to simplify the dataset.
Adding Product IDs:
- Create a unique product_id for each product by indexing the DataFrame.
Mapping Shipping Locations:
- Standardize Ships_From values by mapping specific locations to their respective countries, defaulting to "Malaysia" for unmapped locations.
Splitting Categories:
- Split the item_category_detail column into main_category, mid_category, and sub_category based on a | delimiter.
Assigning 'Ships_From' Based on Category:
- Apply a function to determine the Ships_From location probabilistically, depending on the product's main_category:
  - For Electronics, 80% of products are assumed to ship from Singapore.
  - For other categories, 40% are assumed to ship from Singapore.

Removing Irrelevant Columns

Automotive (e.g., Cars):
- Exclude categories like cars and other automotive items that cannot be sold on Shopee Singapore.
Muslim Fashion:
- Remove this category if it's not relevant to the target market or analysis.
Free Gifts:
- Eliminate free gifts that create outliers in price data.

products = products[~((products.mid_category == 'Automotive') & (products.price_actual > 1000))]
products = products[~(products.main_category == 'Muslim Fashion')]
products = products[~(products.price_actual > 10000)]

filter_phrases = ['do not purchase', 'free gift with purchase', 'not for sale', 'free gift', 'not for sell', 'gwp', 'gift with purchase']

for i in filter_phrases:
    products = products[~(products.title.str.contains(i, case=False, na=False))]

products = products[~((products.price_actual > 1000) & (products.title.str.contains('gift', case=False, na=False)))]
products[(products.price_actual > 1000) & (products.title.str.contains('gift', case=False, na=False))]
products.head(5)
products.main_category.value_counts()
products.isna().sum()

# -1 -> No rating

# Fill price_ori with price_actual if price_ori is NA and vice versa
products.loc[products['price_ori'].isna() & products['price_actual'].notna(), 'price_ori'] = products['price_actual']
products.loc[products['price_actual'].isna() & products['price_ori'].notna(), 'price_actual'] = products['price_ori']

# Drop rows where both price_ori and price_actual are NA
products.dropna(subset=['price_ori', 'price_actual'], how='all', inplace=True)

# Fill total_rating with 0 if NA, and item_rating with -1 if NA
products['total_rating'].fillna(0, inplace=True)
products['item_rating'].fillna(-1, inplace=True)

# Fill total_sold and favorite with 0 if NA
products['total_sold'].fillna(0, inplace=True)
products['favorite'].fillna(0, inplace=True)

3. Synthetic Data Generation

Due to limited data availability (checkout count without order details), synthetic data is generated as follows:

Synthetic Data for Age:
- Simulate customer ages assuming independence and normal distribution across all customers.
- Age groups are derived from a Statista report on cross-border online shoppers in Singapore:
  - 18–24: 13%
  - 25–34: 20%
  - 35–44: 22%
  - 45–54: 22%
  - 55–64: 23%
Synthetic Data for Checkout Count:
- Generate synthetic checkout data for the last 6 months and 1 year.
Synthetic Data for Orders:
- Create order data based on synthetic checkout data and other relevant patterns.
Matching Last Checkout Date:
- Align the last checkout date between the customer dataset and the orders dataset.
Matching Last Login Date:
- Ensure the last login date corresponds to or precedes the last checkout date.
Synthetic Data for Prices:
- Generate data for items priced below 25 cents to address pricing gaps.

np.random.seed(42)

# Define the age ranges as tuples
age_ranges = [(18, 24), (25, 34), (35, 44), (45, 54), (55, 64)]
age_probabilities = [0.13, 0.20, 0.22, 0.22, 0.23]

# Identify missing age values
missing_age_mask = SG_customer['age'].isna()

# Randomly select indices based on probabilities, then use these indices to select the range
chosen_indices = np.random.choice(len(age_ranges), size=missing_age_mask.sum(), p=age_probabilities)
random_ages = [np.random.randint(age_ranges[i][0], age_ranges[i][1] + 1) for i in chosen_indices]

# Fill missing age values with the generated random ages
SG_customer.loc[missing_age_mask, 'age'] = random_ages
SG_customer['gender'] = np.random.choice(['Male', 'Female'], size=len(SG_customer))

Making Synthetic Data for Checkout Count in Last 6 Months and 1 Year

To estimate checkout counts for the last 6 months and 1 year, we employ the following approach:

Linear Regression Model:
- Utilize checkout count data for the last 10, 30, and 60 days as input features.
- Fit a linear regression model for each customer to predict their checkout count over extended periods.
Extrapolation:
- Use the regression model to project checkout counts for 6-month and 1-year intervals.

This approach leverages existing short-term data trends to generate realistic long-term synthetic data for checkout counts.

from sklearn.linear_model import LinearRegression

# Prepare the data in the required format
# Create the feature array (time points: 10, 30, 60 days)
time_points = np.array([10, 30, 60]).reshape(-1, 1)

# For each customer, we will fit a linear regression model
def predict_checkout_counts(row):
    # The target array (checkout counts corresponding to time points)
    checkout_counts = np.array([
        row['checkout_count_last_10_days'], 
        row['checkout_count_last_30_days'], 
        row['checkout_count_last_60_days']
    ])
    
    # Fit the linear regression model
    model = LinearRegression()
    model.fit(time_points, checkout_counts)
    
    # Predict checkout counts for 180 days (6 months) and 365 days (1 year)
    checkout_6_months = model.predict(np.array([180](/7-teens/7-teens-DSA3101-2410-Project/wiki/180)))[0]
    checkout_1_year = model.predict(np.array([365](/7-teens/7-teens-DSA3101-2410-Project/wiki/365)))[0]
    
    # Add some random noise for variability
    checkout_6_months += np.random.normal(0, 5)  # Adjust the noise level as needed
    checkout_1_year += np.random.normal(0, 10)   # Adjust the noise level as needed

    checkout_6_months = max(checkout_6_months, 0)
    checkout_1_year = max(checkout_1_year, 0)

    return pd.Series([checkout_6_months, checkout_1_year])

# Apply the function to each row in the DataFrame
SG_customer['checkout_count_last_6_months', 'checkout_count_last_1_year'](/7-teens/7-teens-DSA3101-2410-Project/wiki/'checkout_count_last_6_months',-'checkout_count_last_1_year') = SG_customer.apply(predict_checkout_counts, axis=1)

# Ensure that checkout_count_last_1_year is always greater than checkout_count_last_6_months
SG_customer['checkout_count_last_1_year'] = SG_customer.apply(
    lambda row: max(row['checkout_count_last_1_year'], row['checkout_count_last_6_months'] + np.random.uniform(0, 5)),
    axis=1
)

print(SG_customer['checkout_count_last_10_days', 'checkout_count_last_30_days', 'checkout_count_last_60_days', 'checkout_count_last_6_months', 'checkout_count_last_1_year'](/7-teens/7-teens-DSA3101-2410-Project/wiki/'checkout_count_last_10_days',-'checkout_count_last_30_days',-'checkout_count_last_60_days',-'checkout_count_last_6_months',-'checkout_count_last_1_year').head())

# Apply rounding to ensure values are integers
SG_customer['checkout_count_last_6_months'] = SG_customer['checkout_count_last_6_months'].round().astype(int)
SG_customer['checkout_count_last_1_year'] = SG_customer['checkout_count_last_1_year'].round().astype(int)
SG_customer.head(5)
sum(SG_customer.checkout_count_last_1_year)

Making Synthetic Data for Orders

Synthetic order data generation considers three key aspects:

Order Time (Morning/Noon/Night):
- Assume order time is independent for each customer.
- Apply a random distribution with specified probabilities for morning, noon, and night order times.
Order Date (Weekdays/Weekends/Paydays):
- Correlate order dates with checkout count to reflect realistic purchasing behavior.
- Assign equal probability to weekdays except for weekends.
- Increase probability for specific dates like public holidays and promotional events (e.g., 9.9, 10.10, 11.11, 12.12, Black Friday).
Product Categories:
- Assign product categories randomly to each order, reflecting the diversity of customer preferences.

from datetime import datetime, timedelta

# Assume 29000 unique customers
num_customers = len(SG_customer) + 100

# Define proportions for each preference group
high_preference_ratio = 0.3  # 50% of customers prefer discounts highly
moderate_preference_ratio = 0.15  # 30% are moderately responsive
low_preference_ratio = 0.2    # 20% are indifferent
a_preference_ratio = 0.2 
b_preference_ratio = 0.15 

# Calculate group sizes
high_preference_size = int(num_customers * high_preference_ratio)
moderate_preference_size = int(num_customers * moderate_preference_ratio)
low_preference_size = int(num_customers * low_preference_ratio)
a_preference_size = int(num_customers * a_preference_ratio)
b_preference_size = num_customers - high_preference_size - moderate_preference_size - low_preference_size - a_preference_size

# Define weight arrays for each preference group
high_preference_weights = [8, 2, 1, 5]   # Strong discount preference
moderate_preference_weights = [4, 2, 1, 3]  # Moderate discount preference
low_preference_weights = [2, 2, 1, 2]    # Indifferent to discount
a_preference_weights = [2, 2, 1, 5]  # a
b_preference_weights = [6, 2, 1, 6]  # b

# Assign weights to each customer
weights = (
    [high_preference_weights] * high_preference_size +
    [moderate_preference_weights] * moderate_preference_size +
    [low_preference_weights] * low_preference_size + 
    [a_preference_weights] * a_preference_size +
    [b_preference_weights] * b_preference_size
)

# Shuffle to randomize the distribution among customers
np.random.shuffle(weights)

# Create a DataFrame to store each customer's weights
customer_weights = pd.DataFrame({
    'SG_customer': [f'SG_customer_{i}' for i in range(1, num_customers + 1)],
    'weights': weights
})

# Define all special dates
high_priority_mega_sales_dates = ['2019-11-11', '2019-09-09', '2019-10-10', '2019-12-12']
additional_mega_sales_dates = [
    '2019-03-03', '2019-04-04', '2019-05-05',
    '2019-06-06', '2019-07-07', '2019-08-08'
]

other_seasonal_sales_dates = [
    datetime(2019, 2, 5),   # Chinese New Year Sale
    datetime(2019, 5, 1),   # Labor Day Sale
    datetime(2019, 6, 5),   # Hari Raya Puasa Sale (Eid al-Fitr)
    datetime(2019, 8, 9),   # Singapore National Day Sale
    datetime(2019, 8, 11),  # Hari Raya Haji Sale (Eid al-Adha)
    datetime(2019, 10, 27), # Deepavali Sale (Diwali)
    datetime(2019, 12, 2),  # Cyber Monday Sale
    datetime(2019, 12, 25), # Christmas Sale
    datetime(2019, 12, 31)  # New Year’s Eve Sale
]

high_priority_seasonal_sales_dates = ['2019-11-29']

# Combine all dates into one list
high_priority_mega_sales_dates = [datetime.strptime(date, '%Y-%m-%d') for date in high_priority_mega_sales_dates]
additional_mega_sales_dates = [datetime.strptime(date, '%Y-%m-%d') for date in additional_mega_sales_dates]
high_priority_seasonal_sales_dates = [datetime.strptime(date, '%Y-%m-%d') for date in high_priority_seasonal_sales_dates]

# Function to determine if a date is a special date
def is_more_special_date(date):
    return date in high_priority_mega_sales_dates or date in high_priority_seasonal_sales_dates 

def is_less_special_date(date):
    return date in additional_mega_sales_dates or date in other_seasonal_sales_dates

# Updated day_weight function to select weights based on date and index
def day_weight(date, index):
    weights = customer_weights.iloc[index]['weights']
    if is_more_special_date(date):
        return weights[0]  # Discount weight
    elif is_less_special_date(date):
        return weights[3]  # Discount weight
    elif date.weekday() >= 4:  # Weekends (Friday, Saturday, Sunday)
        return weights[1]  # Weekend weight
    else:
        return weights[2]  # Normal day weight

# Function to assign weights to time periods
def time_weight(hour):
    if 7 <= hour <= 9 or 18 <= hour <= 23:  # High-probability times: 8-11AM, 6-11PM
        return 2.5
    elif 11 <= hour <= 15:
        return 2.0
    else:
        return 0.5

# Generate probability distribution for dates in 2019
def generate_date_distribution(start_date, end_date, index):
    dates = pd.date_range(start_date, end_date)
    weights = np.array([day_weight(date, index) for date in dates])
    probabilities = weights / weights.sum()  # Normalize weights to sum to 1
    return dates, probabilities

# Generate probability distribution for time ranges in a day
def generate_time_distribution():
    # Define time ranges and assign weights
    time_ranges = [
        (7, 9),    # Morning
        (11, 15),  # Lunch
        (18, 23),  # Evening
        (0, 7),    # Midnight to early morning
        (15, 18)   # Afternoon
    ]
    weights = np.array([time_weight((start + end) // 2) for start, end in time_ranges])
    probabilities = weights / weights.sum()  # Normalize weights to sum to 1
    return time_ranges, probabilities

# Function to generate order times based on the weighted probability distributions
def generate_order_times(num_orders, start_date, end_date, idx):
    dates, date_probabilities = generate_date_distribution(start_date, end_date, idx)
    time_ranges, time_probabilities = generate_time_distribution()

    orders = []
    for _ in range(num_orders):
        # Choose a date based on the probability distribution
        chosen_date = np.random.choice(dates, p=date_probabilities)

        # Convert numpy datetime64 to pandas Timestamp, then to Python datetime
        chosen_date = pd.Timestamp(chosen_date).to_pydatetime()

        # Choose a time range based on the probability distribution
        chosen_time_range = np.random.choice(len(time_ranges), p=time_probabilities)
        start_hour, end_hour = time_ranges[chosen_time_range]
        
        # Randomly pick an hour and minute within the chosen time range
        chosen_hour = np.random.randint(start_hour, end_hour + 1)
        chosen_minute = np.random.randint(0, 60)
        
        # Combine the chosen date and time into a datetime object
        order_datetime = datetime(
            chosen_date.year, chosen_date.month, chosen_date.day,
            chosen_hour, chosen_minute
        )

        is_campaign = False
        additional_discount = 0
        if is_more_special_date(chosen_date) or is_less_special_date(chosen_date):
            is_campaign = True
            additional_discount = np.random.randint(4, 16) * 5

        orders.append([order_datetime, is_campaign, additional_discount])

    return orders

from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta

## Function to generate random order times within a specific time range
def generate_order_times(num_orders, start_date, end_date):
    return [start_date + (end_date - start_date) * np.random.random() for _ in range(num_orders)]

# Function to generate orders for each customer
def generate_orders_for_customer(customer_row, product_ids, idx):
    orders = []

     # Generate order times for the last 10 days
    total_orders_last_10_days = customer_row['checkout_count_last_10_days']
    if total_orders_last_10_days > 0:
        start_10_days = datetime(2019, 12, 31) - timedelta(days=10)
        end_10_days = datetime(2019, 12, 31)
        order_times_last_10_days = generate_order_times(total_orders_last_10_days, start_10_days, end_10_days, idx)
        orders.extend([(order_time, np.random.choice(product_ids), customer_row['customer_id']) for order_time in order_times_last_10_days])

    # Generate order times for the last 30 days
    total_orders_last_30_days = customer_row['checkout_count_last_30_days'] - customer_row['checkout_count_last_10_days']
    if total_orders_last_30_days > 0:
        start_30_days = datetime(2019, 12, 31) - timedelta(days=30)
        end_30_days = datetime(2019, 12, 31) - timedelta(days=11)
        order_times_last_30_days = generate_order_times(total_orders_last_30_days, start_30_days, end_30_days, idx)
        orders.extend([(order_time, np.random.choice(product_ids), customer_row['customer_id']) for order_time in order_times_last_30_days])
    
    # Generate order times for the last 60 days (excluding last 30 days)
    total_orders_last_60_days = customer_row['checkout_count_last_60_days'] - customer_row['checkout_count_last_30_days']
    if total_orders_last_60_days > 0:
        start_60_days = datetime(2019, 12, 31) - timedelta(days=60)
        end_60_days = datetime(2019, 12, 31) - timedelta(days=31)
        order_times_last_60_days = generate_order_times(total_orders_last_60_days, start_60_days, end_60_days, idx)
        orders.extend([(order_time, np.random.choice(product_ids), customer_row['customer_id']) for order_time in order_times_last_60_days])
    
    # Generate order times for the last 180 days (excluding last 30 days)
    total_orders_last_6_months = customer_row['checkout_count_last_6_months'] - customer_row['checkout_count_last_60_days']
    if total_orders_last_6_months> 0:
        start_60_days = datetime(2019, 12, 31) - relativedelta(months=6)
        end_60_days = datetime(2019, 12, 31) - timedelta(days=61)
        order_times_last_6_months = generate_order_times(total_orders_last_6_months, start_60_days, end_60_days, idx)
        orders.extend([(order_time, np.random.choice(product_ids), customer_row['customer_id']) for order_time in order_times_last_6_months])

    # Generate order times for the rest of the year (excluding last 60 days)
    total_orders_last_1_year = customer_row['checkout_count_last_1_year'] - customer_row['checkout_count_last_6_months'] 
    if total_orders_last_1_year > 0:
        start_1_year = datetime(2019, 1, 1)
        end_1_year = datetime(2019, 12, 31) - relativedelta(months=6) - timedelta(days=1)
        order_times_last_1_year = generate_order_times(total_orders_last_1_year, start_1_year, end_1_year, idx)
        orders.extend([(order_time, np.random.choice(product_ids), customer_row['customer_id']) for order_time in order_times_last_1_year])

    # Generate order IDs and construct order records
    order_ids = [i for i in range(len(order_dataset) + 1, len(order_dataset) + len(orders) + 1)]
    
    orders_df = pd.DataFrame({
        'order_id': order_ids,
        'order_time': [order[0][0] for order in orders],
        'is_campaign': [order[0][1] for order in orders],
        'additional_discount': [order[0][2] for order in orders],
        'product_id': [order[1] for order in orders],
        'customer_id': [order[2] for order in orders]
    })

    return orders_df

# Initialize the order dataset
order_dataset = pd.DataFrame(columns=['order_id', 'order_time', 'product_id', 'customer_id'])

# Get product IDs from the product dataset
product_ids = products['product_id'].values

# Generate orders for each customer
for idx, customer_row in SG_customer.iterrows():
    customer_orders = generate_orders_for_customer(customer_row, product_ids, idx)
    order_dataset = pd.concat([order_dataset, customer_orders], ignore_index=True)
print(idx)

# Function to assign campaign names based on the order date
def get_campaign_name(order_time):
    order_date = order_time.date()
    if order_date in [d.date() for d in other_seasonal_sales_dates]:
        return "Other Seasonal Sales Date"
    elif order_date in [d.date() for d in high_priority_mega_sales_dates]:
        return "High Priority Mega Sales Date"
    elif order_date in [d.date() for d in additional_mega_sales_dates]:
        return "Additional Mega Sales Date"
    elif order_date in [d.date() for d in high_priority_seasonal_sales_dates]:
        return "High Priority Seasonal Sales Date"
    else:
        return np.nan  # If no match, return None

# Apply the function to create the 'campaign_name' column
order_dataset['campaign_name'] = order_dataset['order_time'].apply(get_campaign_name)
order_dataset
SG_customer
from datetime import date

# Ensure order_time is in datetime format
order_dataset['order_time'] = pd.to_datetime(order_dataset['order_time'], errors='coerce')

# Function to apply the logic based on 'last_checkout_day'
def process_checkout(row):
    min_order_time = order_dataset[order_dataset.customer_id == row.customer_id]['order_time'].max()
    if pd.notnull(min_order_time):
        row['last_checkout_day'] = (date(2019, 12, 31) - min_order_time.date()).days
    else:
        row['last_checkout_day'] = 'Never Checkout'  # or any default value if needed
    return row

# Apply the function to the DataFrame
SG_customer = SG_customer.apply(process_checkout, axis=1)
SG_customer
sum(SG_customer.checkout_count_last_1_year)

Code Explanation

This code simulates a detailed order generation process for customers over a year, considering shopping behaviors, special sales events, and customer preferences.

1. Define Customer Preferences

Calculate the number of customers based on their preference for discounts (high, moderate, low, a, b) using predefined ratios.
Assign discount weight arrays to each group, representing their likelihood to shop during special events, weekends, or regular days.
Shuffle weights to randomize the customer distribution.

2. Special Sales Dates

Define key sales dates such as mega sales, seasonal events, and high-priority sales.
Functions is_more_special_date and is_less_special_date identify whether a date corresponds to these events.

3. Generate Weighted Probabilities

Date Weights: Use the day_weight function to assign weights based on special dates, weekends, or regular days.
Time Weights: Define time slots with different probabilities for shopping during morning, afternoon, or evening.

4. Generate Orders

Generate orders for customers based on their checkout_count data over:
- Last 10 days
- Last 30 days
- Last 60 days
- Last 6 months
- The entire year (excluding the last 6 months)
Assign order times by sampling from weighted date and time distributions.
Associate each order with a product ID and customer ID.

5. Assign Campaign Information

A get_campaign_name function maps each order to a campaign based on the order date (e.g., Mega Sales or Seasonal Sales).

6. Add `last_checkout_day` to Customers

For each customer, determine the most recent order date from the order dataset.
Calculate the number of days since their last checkout or set a default value if no orders exist.

7. Output and Summary

The final order_dataset contains:
- order_id, order_time, product_id, customer_id, campaign_name
The SG_customer dataset includes:
- Updated last_checkout_day and total checkout counts.

This simulation captures customer purchasing behaviors while aligning with realistic sales trends and preferences.

Making Synthetic Data for Prices Less Than 10 Cents

To address suspiciously low prices for some products, we follow this approach:

Identify Distribution:
- Analyze the distribution of products priced below 10 cents to understand their characteristics and frequency.
Adjust Pricing:
- Apply a function to align these prices with actual pricing trends observed in the Singapore marketplace.
- Ensure adjustments are consistent with marketplace norms and supplier pricing.

products[products.price_ori < 1].head()

WARNING: Need More Research

Current approach is based on observation only, requiring further research to ensure accuracy.
For now, prices below 10 cents will be adjusted by multiplying them by a scalar factor of 4.

# Apply different multipliers based on the conditions
products.loc[products['price_ori'] < 0.01, ['price_ori', 'price_actual']] = products.loc[products['price_ori'] < 0.01, ['price_ori', 'price_actual']].apply(lambda x: x * np.random.randint(10, 15), axis=1)
products.loc[products['price_ori'] < 0.1, ['price_ori', 'price_actual']] = products.loc[products['price_ori'] < 0.1, ['price_ori', 'price_actual']].apply(lambda x: x * np.random.randint(8,11), axis=1)
products.loc[products['price_ori'] < 0.5, ['price_ori', 'price_actual']] = products.loc[products['price_ori'] < 0.5, ['price_ori', 'price_actual']].apply(lambda x: x * np.random.randint(1, 4), axis=1)
products.loc[products['price_ori'] < 1, ['price_ori', 'price_actual']] = products.loc[products['price_ori'] < 1, ['price_ori', 'price_actual']].apply(lambda x: x * np.random.randint(1, 3), axis=1)
order_dataset.to_csv('orders.csv', index=False)
SG_customer.to_csv('SG_customers.csv', index=False)
products.to_csv('products_cleaned.csv', index=False)

Feature Engineering Techniques

Data simulation

This section details the process of generating synthetic data to simulate the performance of different marketing channels, campaigns and sales for Shopee. The aim is to attempt to create a realistic dataset that can be used for our analysis, enabling us to have a better understanding of the effectiveness of various marketing strategies and campaigns.

Step 1: Outlier detection and treatment

Data Cleaning for products dataset: We want to clean the products dataset so that we can make use of the products sold on Shopee to generate synthetic data for our sales dataset.

# Define the list of phrases to filter out in the title column
filter_phrases = ['do not purchase', 'free gift with purchase', 'not for sale', 'free gift', 'not for sell', 'gwp', 'gift with purchase']
pattern = '|'.join(filter_phrases)

# Filter the DataFrame to exclude rows with titles containing specified phrases and total_sold == 0
# Filter the DataFrame to exclude rows with specified phrases in 'title', 'total_sold' equal to 0, and 'price_actual' outside the 1 to 10000 range
products = products[
    ~products['title'].str.contains(pattern, case=False, na=False) &
    (products['total_sold'] != 0) &
    (products['price_actual'] >= 1) &
    (products['price_actual'] <= 10000)
]

Removed products with total_sold == 0 : Ensure that products in sales dataset are sold before.
Removed products with ‘free gift’, ‘not for sale’, ‘gwp’, ‘do not purchase’, ‘gift with purchase’ in their title: Ensure that products used in our sales dataset are actual products purchased by customers. Based on observations, this approach could also eliminate products with extreme pricing, as sellers often set the prices of these free gifts either excessively high or excessively low.
We also observe that there are two products with unusually high prices (> 0,000), but the title does not match the filtered phrases. So we manually removed them so that those values will not skew the data.

Data preparation for customers_sg dataset: This is done to connect the customers_sg dataset to other datasets like the synthetic_campaign_data dataset.

# Remove duplicated customer_id entries, keeping the first occurrence
customers_df = customers_df.drop_duplicates(subset='customer_id', keep='first').copy()

# Generate a column to determine if it is new or returning customers in the customer_df
customers_df.loc[:, 'target_audience'] = customers_df['last_checkout_day'].apply(
    lambda x: 'new customer' if x == 'Never checkout' else 'returning customer'
)

customers_df.to_csv('customer_data.csv', index=False)

Adding ‘target_audience’ column in customer_df.
Removed duplicate customer_id entries, keeping the first occurrence.
Used ‘last_checkout_day’ to generate the ‘target_audience’ column: To see if the customer is a new customer or a returning customer.

Step 2: Preparing synthetic dataset for marketing channels

Marketing Channel Data frame:

np.random.seed(42)
num_rows_marketing = 100000
marketing_channels = ['Email', 'Social Media', 'In-App', 'Website', 'SMS', 'KOL']

The simulation begins by setting a random seed to ensure reproducibility. A total of 100,000 rows is created for the marketing dataset. Six marketing channels are considered: Email, Social Media, In-App, Website, SMS and KOL (Key Opinion Leaders).

1. Defining Sales Dates

mega_sales_dates = pd.to_datetime(['2019-01-01', '2019-02-02', '2019-03-03', '2019-04-04', '2019-05-05',
                                    '2019-06-06', '2019-07-07', '2019-08-08', '2019-09-09', '2019-10-10',
                                    '2019-11-11', '2019-12-12'])

# Christmas, new year, cyber monday, black friday sales dates etc
seasonal_sales_dates = pd.to_datetime(['2019-01-01', '2019-02-05', '2019-04-19', '2019-05-01', '2019-06-05',
                                       '2019-06-07', '2019-08-09', '2019-08-11', '2019-10-06', '2019-10-27', '2019-11-29', '2019-12-02',
                                       '2019-12-25', '2019-12-31'])

high_sales = pd.to_datetime(['2019-11-11', '2019-12-12', '2019-11-29'])
medium_sales = pd.to_datetime(['2019-06-06', '2019-09-09'])

Specific sales dates, including Mega Sales and Seasonal Sales, are identified to simulate increased marketing activities around these periods. Shopee is known for its double digit campaigns (Mega Sales Eg. 10.10, 11.11, 12.12) and additional dates such as Christmas, New Year, Cyber Monday, and Black Friday are included under Seasonal Sales.
Specific dates are categorised as high-sales events (eg. 11.11 Singles' Day and Christmas) and medium-sales events are identified.

2. Pre-Sale Marketing Build-Up

# Build up marketing starting 7 days before sales, with increasing priority as the sale day approaches
days_before_sales = []
for sale_date in high_sales:
    days_before_sales.extend([sale_date - pd.DateOffset(days=i) for i in range(1, 8)])

We made the marketing dataset to simulate marketing activity building up to sales events to emphasise Shopee's marketing efforts for the sale day. Build up marketing efforts start seven days before sales date, with increasing priority as the sale day approaches.
This loop calculates the 7 days leading up to each high-sales date, creating a list (days_before_sales) of dates that will be prioritised for marketing efforts. These dates are derived by subtracting a range of days (from 1 to 7) from each high-sales date. This results in higher marketing exposure on days immediately before sales days.

3. Define Mid-Month Days

                                ['2019-' + str(month).zfill(2) + '-25' for month in range(1, 13)])

This block creates a list of the 15th and 25th days of each month in 2019, which may be considered relevant dates for mid-month promotions or marketing activities.

4. Generate All Dates in 2019

dates = pd.date_range(start='2019-01-01', end='2019-12-31', freq='D').to_list()

This line generates a complete list of all dates in 2019, using a daily frequency, to establish a full timeline for analysis.

5. Define Base Probabilities

base_prob = 1 / len(dates)
probabilities = np.full(len(dates), base_prob)

The base probability is set for regular days, calculated as the reciprocal of the total number of dates. An array of probabilities is created, where each date starts with an equal likelihood of being selected.

6. Adjust probabilities for sale day periods

# Adjust probabilities for sale days and periods
for i, date in enumerate(dates):
    if date in high_sales:
        # Higher priority for sales dates but slightly lower than pre-sale days
        probabilities[i] = 0.04
    elif date in days_before_sales:
        # Pre-sale marketing build-up for high-priority sales (1-7 days before)
        days_diff = (high_sales - date).days.min()
        if days_diff == 1:
            probabilities[i] = 0.06
        elif days_diff in [2, 3]:
            probabilities[i] = 0.04
        elif days_diff in [4, 5]:
            probabilities[i] = 0.03
        elif days_diff in [6, 7]:
            probabilities[i] = 0.015
    elif date in mid_month_days:
        probabilities[i] = 0.015
    elif date in seasonal_sales_dates:
        probabilities[i] = 0.05
    else:
        probabilities[i] = base_prob

Probability Distribution for Dates: Probabilities for marketing activity on different dates are established based on their significance:
- Higher probabilities are assigned to high sales days and the days leading up to them.
- Moderate probabilities are assigned to mid-month days and seasonal sale dates.
- Regular dates have lower probabilities, attempting to reflect standard marketing efforts.

7. Normalise Probabilities & Sampling Marketing Dates

probabilities /= probabilities.sum()
marketing_dates = np.random.choice(dates, size=num_rows_marketing, p=probabilities)

The probabilities are normalized to ensure that they sum to 1. This step is critical for using the probabilities in a random sampling process.
Finally, the code samples a set number of marketing dates (num_rows_marketing) from the complete list of dates. The sampling uses the adjusted probabilities, ensuring that marketing activities are more likely to occur on important dates as defined earlier.

8. Channel Expenditure Simulation Based on Shopee Revenue and Usage Statistics, Shopee spent an estimated 500 million USD on marketing in Southeast Asia (Curry D., 2024). This expenditure is then roughly distributed across different SEA countries using an estimated proportion. We then obtained an estimated amount Shopee Singapore spent on marketing.

# Given Shopee spent an estimated US$500 million on marketing spend in SEA
total_shopee_spend = 500000000
sg_share = 0.04
sg_shopee_spend = total_shopee_spend * sg_share

Total Marketing Spend: The total estimated marketing spend for Shopee in Southeast Asia is set at US$500 million.
Singapore Market Share: A market share of 4% (sg_share) is defined, representing Shopee Singapore’s portion of the total marketing budget.
Singapore Marketing Spend: The code calculates Shopee’s marketing spend specifically for Singapore by multiplying the total spend by the market share, resulting in sg_shopee_spend.

9. Channel Proportions and Expenditure

# Simulate Shopee's marketing channel proportions
channel_proportions = {
    'Email': 0.12,
    'Social Media': 0.22,
    'In-App': 0.24,
    'KOL': 0.20,
    'Website': 0.17,
    'SMS': 0.05,
}

# We calculate total expenditure per channel based on proportions
total_expenditures = {channel: sg_shopee_spend * proportion for channel, proportion in channel_proportions.items()}

Marketing Channel Proportions: A dictionary defines the proportion of the marketing budget allocated to various channels, such as Email, Social Media, In-App notifications, KOLs (Key Opinion Leaders), Website promotions, and SMS. The expenditure is then distributed across marketing channels based on predetermined proportions. For example, In-App marketing receives a larger share while Email and SMS marketing have a smaller share. Each channel's proportion reflects its importance in Shopee's marketing strategy.
Total Expenditures Calculation: This line calculates the total expenditure for each marketing channel by multiplying Shopee’s spend in Singapore by the proportion allocated to each channel. The result is stored in a dictionary called total_expenditures.

10. Defining Marketing Weights for Different Sales Period

# Simulate for pre-sale days - higher for social media and KOL as these channels are likely to have more marketing pre-sale
pre_sale_channel_weights = {
    'Email': 0.08,
    'Social Media': 0.35,
    'In-App': 0.15,
    'Website': 0.12,
    'SMS': 0.05,
    'KOL': 0.25
}

# Simulate for sale days - higher for in-app
sale_day_channel_weights = {
    'Email': 0.07,
    'Social Media': 0.20,
    'In-App': 0.35,
    'Website': 0.15,
    'SMS': 0.05,
    'KOL': 0.18
}

Different channel weights are applied for pre-sale and sale days:

Pre-sale days are simulated to have higher focus on Social Media and KOL marketing channels.
Similar to pre-sale, this defines the channel weights specifically for the actual sale days, with a higher emphasis on In-App marketing.

11. Channel Selection for Marketing Dates

channels = []
for date in marketing_dates:
    if date in high_sales:
        channel_weights = list(sale_day_channel_weights.values())
        chosen_channel = np.random.choice(marketing_channels, p=channel_weights)
    elif date in days_before_sales:
        channel_weights = list(pre_sale_channel_weights.values())
        chosen_channel = np.random.choice(marketing_channels, p=channel_weights)
    else:
        channel_weights = list(channel_proportions.values())
        chosen_channel = np.random.choice(marketing_channels, p=channel_weights)
    channels.append(chosen_channel)

Channel Selection Process: This loop iterates over the marketing dates to select a marketing channel based on the defined weights:

High Sales Dates: If the date is a high sales date, it uses the sale day weights.
Days Before Sales: For days leading up to sales, it uses pre-sale weights.
Regular Days: On other dates, it defaults to the standard channel proportions. The selected channels are appended to the channels list for further analysis.

12. Creating the Marketing DataFrame

marketing_df = pd.DataFrame({'channel_name': channels})

A Pandas DataFrame (marketing_df) is created with a single column containing the selected marketing channels for each date, setting the stage for further calculations.

13. Distributing Expenditure with Random Variation

# We want to calculate the total expenditure per channel
channel_counts = marketing_df['channel_name'].value_counts().to_dict()

# We want to spread out the expenditure per channel and add in some random variation for realism
def spread_expenditure(channel_name):
    if channel_name in channel_counts:
        current_expenditure = total_expenditures[channel_name] / channel_counts[channel_name]
        random_variation = np.random.uniform(-0.2, 0.2)  # Variation between -20% and +20%
        return round(current_expenditure * (1 + random_variation), 2)
    return 0

marketing_df['channel_expenditure'] = marketing_df['channel_name'].apply(spread_expenditure)

The spread_expenditure function is defined to calculate and allocate expenditure for each marketing channel while introducing randomness in an attempt to make the dataset more realistic.
In this function, the current expenditure for each channel is determined by dividing the total expenditure by the number of times that channel appears in the dataset.
A random variation between -20% and +20% is applied to this expenditure to simulate realistic fluctuations in spending.

14. Visitor and Conversion Rate Simulation

visitor_multipliers = {
    'Email': np.random.uniform(2, 4),
    'SMS': np.random.uniform(1.5, 3),
    'KOL': np.random.uniform(2, 6),
    'In-App': np.random.uniform(2, 5.5),
    'Social Media': np.random.uniform(1.2, 7),
    'Website': np.random.uniform(2, 4.2)
}

visitors = [
    np.round(marketing_df['channel_expenditure'][i] * visitor_multipliers[channels[i]], 0)
    for i in range(num_rows_marketing)
]

conversion_rates = {
    'Email': np.random.uniform(0.025, 0.04),
    'SMS': np.random.uniform(0.03, 0.035),
    'KOL': np.random.uniform(0.05, 0.08),
    'In-App': np.random.uniform(0.03, 0.06),
    'Social Media': np.random.uniform(0.02, 0.05),
    'Website': np.random.uniform(0.02, 0.04)
}

conversion_rate_list = np.array([
    conversion_rates[channels[i]] for i in range(num_rows_marketing)
])

Visitor Multipliers: A dictionary defines multipliers for estimating the number of visitors generated by each marketing channel. These multipliers are randomly assigned within specified ranges, reflecting the varying effectiveness of each channel in attracting visitors.
Visitor Calculation: This list comprehension calculates the number of visitors generated for each marketing activity by multiplying the channel expenditure by its corresponding visitor multiplier, rounding the results for a realistic count.
Conversion Rates: The code sets up a dictionary with simulated conversion rates for each channel within specific ranges, reflecting how effectively each channel converts visitors into customers. While exact Shopee conversion rate figures for their marketing channels are not publicly available, we refer to average e-commerce conversion rates by marketing channels (Invesp.) to simulate specific scenarios that Shopee may experience.

15. Sales Calculation

Calculating sales generated based on visitors and conversion rates:

sales_generated = np.round(visitors * conversion_rate_list, 0)
sales_generated = np.where(sales_generated == 0, 1, sales_generated)
conversion_rate_calc = (sales_generated / visitors) * 100

Sales Calculation: The code calculates sales generated by multiplying the number of visitors by their respective conversion rates. The results are rounded to ensure they are whole numbers. If the calculated sales are zero, it sets the sales to one to avoid issues with zero sales. Conversion rate is then calculated again to ensure consistency.

16. AOV by Channel

# Average Order Value - overall for Shopee is around $15-$25
aov_list = [
    np.random.uniform(8, 25) if channel == 'Email' else
    np.random.uniform(6, 10) if channel == 'SMS' else
    np.random.uniform(15, 30) if channel == 'KOL' else
    np.random.uniform(10, 22) if channel == 'In-App' else
    np.random.uniform(10, 16) if channel == 'Social Media' else
    np.random.uniform(8, 20)
    for channel in channels
]

AOV Simulation: This block generates an AOV for each channel, using different ranges for each channel type. AOV reflects the average amount spent per transaction and is crucial for calculating total revenue.

17. Total Revenue Calculation

# Calculate Total Revenue based on sales and AOV, but ensure that revenue is > expenditure for realism
total_revenue = []

for i in range(num_rows_marketing):
    expenditure = marketing_df['channel_expenditure'].iloc[i]
    revenue = sales_generated[i] * aov_list[i]
    if revenue <= expenditure:
        revenue = expenditure * np.random.uniform(1.1, 1.3)

    total_revenue.append(round(revenue, 2))

Revenue Initialisation: An empty list total_revenue is created to store the calculated revenue for each marketing activity.
Iterate Over Marketing Rows: The loop iterates over the number of marketing activities (num_rows_marketing).
Expenditure Retrieval: For each activity, the expenditure for the respective channel is retrieved from the DataFrame.
Revenue Calculation: Revenue is initially calculated by multiplying the number of sales generated by the AOV for that activity. This gives an initial estimate of revenue based on actual sales performance.
Revenue Adjustment for Realism: If the calculated revenue is less than or equal to the expenditure, the code ensures that the revenue reflects a more realistic scenario by setting it to a value between 10% and 30% above the expenditure. This adjustment maintains a realistic business model where revenue should generally exceed expenditure.
Append Revenue: The final revenue amount is rounded to two decimal places and added to the total_revenue list for later analysis.

18. Click-Through Rate (CTR) Simulation

# We want to simulate the CTR for each marketing channel based on average E-commerce CTR scenarios
click_through_rate = [
    np.random.uniform(0.02, 0.03) if channel == 'Email' else  # 2% to 3%
    np.random.uniform(0.01, 0.02) if channel == 'SMS' else    # 1% to 2%
    np.random.uniform(0.02, 0.04) if channel == 'KOL' else    # 2% to 4%
    np.random.uniform(0.005, 0.015) if channel == 'In-App' else  # 0.5% to 1.5%
    np.random.uniform(0.01, 0.04) if channel == 'Social Media' else  # 1% to 4%
    np.random.uniform(0.005, 0.01)  # 0.5% to 1% for Website Ads
    for channel in channels
]

CTR Initialization: The click_through_rate list is created to store simulated click-through rates for each marketing channel.
CTR Values Based on Channel Type: The code assigns random values within specified ranges for each channel type:
- Email: 2% to 3%
- SMS: 1% to 2%
- KOL: 2% to 4%
- In-App: 0.5% to 1.5%
- Social Media: 1% to 4%
- Website Ads: 0.5% to 1% This simulation is made to reflect average e-commerce CTR scenarios, providing a realistic measure of how effectively each channel engages users. While exact Shopee CTR figures are not publicly available, we refer to average e-commerce CTR in 2022 (Converted.) to simulate specific scenarios that Shopee may experience.

19. Bounce Rates Simulation

# We want to simulate the Bounce rates for each marketing channel based on average E-commerce bounce rates scenarios
bounce_rate = [
    np.random.uniform(30, 40) if channel == 'Email' else       # 30-40%
    np.random.uniform(40, 50) if channel == 'SMS' else         # 40-50%
    np.random.uniform(35, 45) if channel == 'KOL' else         # 35-45%
    np.random.uniform(50, 60) if channel == 'In-App' else      # 50-60%
    np.random.uniform(50, 60) if channel == 'Social Media' else # 50-60%
    np.random.uniform(55, 65)  # 55-65% for Website ads
    for channel in channels
]

Bounce Rate Initialization: The bounce_rate list is created to store simulated bounce rates for each marketing channel.
Bounce Rate Values Based on Channel Type: Similar to CTR, the code assigns random bounce rate values within specific ranges:
- Email: 30% to 40%
- SMS: 40% to 50%
- KOL: 35% to 45%
- In-App: 50% to 60%
- Social Media: 50% to 60%
- Website Ads: 55% to 65% The bounce rate is an important metric in e-commerce, indicating the percentage of visitors who leave a page without engaging. This simulation is made to reflect average e-commerce bounce rates scenarios. Bounce rates help to provide insights into potential customer retention and engagement challenges for each marketing channel. While exact Shopee bounce rate figures are not publicly available, we refer to average e-commerce bounce rates by marketing channels (Convertcart.) to simulate specific scenarios that Shopee may experience. Email and KOL marketing have lower bounce rates as they are often more targeted, and users are more engaged. Social Media marketing have higher bounce rates as users typically view the content and return to the app. Website marketing has the highest bounce rates due to their perception as clutter and intrusiveness. (Convertcart.)

20. Creation of final Marketing DataFrame

# Now we join the above columns to create our synthetic dataframe
marketing_df = pd.DataFrame({
    'date': marketing_dates,
    'channel_name': channels,
    'channel_expenditure': marketing_df['channel_expenditure'],
    'channel_visitors': visitors,
    'channel_sales': sales_generated,
    'conversion_rate': conversion_rate_list,
    'average_order_value': aov_list,
    'total_revenue': total_revenue,
    'click_through_rate': click_through_rate,
    'bounce_rate': bounce_rate,
    #'customer_retention_rate': customer_retention_rate
})

Data frame Initialization: A new data frame named marketing_df is created using the pd.DataFrame() constructor from the Pandas library. This data frame will contain several columns that represent different aspects of the marketing simulation.
Column Assignments:
- date: The list of marketing dates generated earlier.
- channel_name: The selected marketing channels for each date.
- channel_expenditure: The expenditures for each channel on the corresponding marketing date.
- channel_visitors: The number of visitors attracted to Shopee through each channel.
- channel_sales: The number of sales generated from the visitors of each channel.
- conversion_rate: The conversion rates associated with each marketing channel.
- average_order_value: The average order values calculated for each channel.
- total_revenue: The total revenue generated from sales for each channel.
- click_through_rate: The click-through rates for each marketing channel.
- bounce_rate: The bounce rates associated with each marketing channel.

print(marketing_df.head())
marketing_df['day'] = marketing_df['date'].dt.day_name()

Day Name Extraction: A new column named day is added to the data frame, which extracts the day of the week from the date column using the dt.day_name() function. This can provide additional insights into the impact of marketing activities based on the day of the week.

Campaign Data Frame

np.random.seed(42)
num_rows = 100000

The simulation begins by setting a random seed to ensure reproducibility. A total of 100,000 rows is created for the marketing dataset.

1. Generating Campaign Data

campaign_ids = [np.random.randint(1000000, 9999999) for _ in range(num_rows)]
campaign_types = ['Flash Sale', 'Seasonal Sales', 'Bundle promotions', 'Mega Sales', 'Livestream Exclusive', 'Next Day Delivery']
campaign_costs = np.random.randint(1000, 10000, size=num_rows)

Campaign IDs: A list of random integers is generated for campaign IDs, simulating unique identifiers for each campaign.
Campaign Types: A predefined list of possible campaign types is created, which includes different promotional strategies.
- Six types of campaigns are predefined: Flash Sale, Seasonal Sales, Bundle Promotions, Mega Sales, Livestream Exclusive and Next Day Delivery.
Campaign Costs: Random integers representing campaign costs are generated, ranging from $1,000 to $10,000. Campaign cost is made to attempt to simulate real life campaign costs.

2. Session ID and Customer ID Generation

# Unique session ids
unique_session_ids = np.random.choice(range(1, 999999), size=num_rows, replace=False)
session_ids = np.random.choice(unique_session_ids, size=num_rows, replace=False)

Unique session IDs are generated to track user interactions.

customer_ids = np.random.choice(customers_df['customer_id'], size=num_rows, replace=True)

Customer IDs are sampled from the existing customers_df data frame, ensuring that datasets used for our project are connected. This simulates the allocation of campaigns to customers, allowing for repeat customers in the dataset.

3. Gender and Age Group Simulation

# Generate gender & age columns for the customer
gender = np.random.choice(['Male', 'Female'], size=num_rows)
years = ['0-17 years', '18-24 years', '25-34 years', '35-44 years', '45-54 years', '55-64 years', '65 years and older']
age_group_probabilities = [0.04, 0.20, 0.33, 0.25, 0.12, 0.06, 0.00]  #increase probability for '25-34 years'
age_group = np.random.choice(years, size=num_rows, p=age_group_probabilities)

Gender Distribution: Randomly assigns gender to each campaign entry.
Age Group Generation: A list of age groups is defined, and the age_group_probabilities array assigns varying probabilities for each age category, with a higher probability for the 25-34 years age group. The probability of age groups are defined to simulate actual age group distribution of Singapore online shoppers. Based on an article on Singapore's Online Shoppers 2019, more than a third of Singapore's online shoppers are aged between 25 and 34 (34.3%). This is followed by the 33-44 age group. them the 18-24 age group (Lim A., 2019).

4. Marketing Channel probabilities

marketing_channel_probabilities = {
    'In-App': [0.2, 0.80],
    'Social Media': [0.1, 0.9],
    'KOL': [0.08, 0.92],
    'Email': [0.05, 0.95],
    'Website': [0.05, 0.95],
    'SMS': [0.02, 0.98]
}

# Simulate campaign data with marketing channels
channels = np.random.choice(list(marketing_channel_probabilities.keys()), size=num_rows)

# Generate is_click based on the probabilities for each channel
is_click = []
for channel in channels:
    click_prob = marketing_channel_probabilities[channel]
    is_click.append(np.random.choice([1, 0], p=click_prob))

is_click = np.array(is_click)

CTR Probabilities: A dictionary defines click-through rates for various marketing channels. Each channel has a probability of being clicked (1) or not clicked (0). For instance, ‘In-App’ has a 20% chance of a click and an 80% chance of no click.
Channel Selection: Randomly selects marketing channels for each campaign.
Simulating Clicks: A loop generates a binary array (is_click) indicating whether a click occurred for each channel based on its defined probabilities.

Incorporating click probabilities into the synthetic dataset helps simulate the complexities of real marketing conditions, where not every interaction leads to a click. This realism enhances the validity of the dataset and the subsequent analyses, providing a more accurate picture of potential marketing performance.

5. Defining Key Sales Dates

mega_sales_dates = pd.to_datetime(['2019-01-01', '2019-02-02', '2019-03-03', '2019-04-04', '2019-05-05',
                                   '2019-06-06', '2019-07-07', '2019-08-08', '2019-09-09', '2019-10-10',
                                   '2019-11-11', '2019-12-12'])

seasonal_sales_dates = pd.to_datetime(['2019-01-01', '2019-02-05', '2019-04-19', '2019-05-01', '2019-06-05',
                                       '2019-06-07', '2019-08-09', '2019-08-11', '2019-10-06', '2019-10-27', '2019-11-29', '2019-12-02',
                                       '2019-12-25', '2019-12-31'])

6. Campaign Start and End Dates Generations

# Initialize lists for start and end dates
start_dates = []
end_dates = []
campaign_type_choices = []

# Define available dates excluding mega_sales_dates and seasonal_sales_dates
excluded_dates = list(mega_sales_dates) + list(seasonal_sales_dates)
available_dates = pd.date_range('2019-01-01', '2019-12-31', freq='D').difference(excluded_dates)

# Adjust the loop to select from available_dates for non-special campaigns
for _ in range(num_rows):
    campaign_type = np.random.choice(campaign_types, p=[0.1, 0.1, 0.1, 0.4, 0.15, 0.15])  # Higher probability for Mega Sales
    campaign_type_choices.append(campaign_type)

    if campaign_type == 'Mega Sales':
        if np.random.rand() < 0.5:
            date = pd.Timestamp('2019-11-11')
        else:
            date = np.random.choice(mega_sales_dates)
        start_dates.append(date - pd.Timedelta(days=1))
        end_dates.append(date + pd.Timedelta(days=1))
    elif campaign_type == 'Seasonal Sales':
        index = np.random.randint(len(seasonal_sales_dates))
        start_dates.append(seasonal_sales_dates[index] - pd.Timedelta(days=1))
        end_dates.append(seasonal_sales_dates[index] + pd.Timedelta(days=1))
    else:
        # Select a random date from available_dates for other campaign types
        random_start = pd.Timestamp(np.random.choice(available_dates))
        start_dates.append(random_start)
        end_dates.append(random_start + pd.Timedelta(days=3))

Start and End Dates: This block initializes empty lists for campaign start and end dates.
Available Dates: It calculates available dates that do not coincide with major sales dates to allow for regular campaigns.
Campaign Duration: For each row, it randomly selects a campaign type, assigns start and end dates based on whether it's a Mega Sale or another campaign type. The duration for regular campaigns is set to three days.

# Marketing channel, target audience, promotional type, and discount columns
marketing_channels = ['In-App', 'Social Media', 'KOL', 'Email', 'Website', 'SMS']
marketing_channel_prob = [0.4, 0.3, 0.2, 0.03, 0.05, 0.02]  # Higher probabilities for In-App, Social Media, and KOL

promotional_discount_types = np.random.choice(['Percentage', 'Fixed Amount'], size=num_rows)
promotional_discount_values = np.zeros(num_rows)

promotional_discount_values[promotional_discount_types == 'Percentage'] = np.random.randint(20, 61, size=(promotional_discount_types == 'Percentage').sum())
promotional_discount_values[promotional_discount_types == 'Fixed Amount'] = np.random.randint(2, 21, size=(promotional_discount_types == 'Fixed Amount').sum())

The probabilities assigned to each channel are based on expected user engagement levels:

Higher engagement channels: Channels like 'In-App' and 'KOL' have higher click probabilities, to reflect their effectiveness in capturing user attention and driving interactions. This aligns with the common understanding that users are more likely to engage with content presented through these channels.
Lower Engagement Channels: Channels like 'SMS' and 'Email' have lower click probabilities, indicating that while they can still drive traffic, they may not be as effective as other channels in terms of engagement.

Promotional Discounts: Two types of promotional discounts (percentage and fixed amount) are randomly assigned for each campaign. The corresponding discount values are generated based on the type of discount, with percentages ranging from 20% to 60% and fixed amounts between $2 and $20.

7. Creating the final Campaign DataFrame

campaign_df = pd.DataFrame({
    'campaign_id': campaign_ids,
    'campaign_type': campaign_type_choices,
    'campaign_cost': campaign_costs,
    'session_id': session_ids,
    'start_date': start_dates,
    'end_date': end_dates,
    'marketing_channel': np.random.choice(marketing_channels, size=num_rows, p=marketing_channel_prob),
    # 'target_audience': target_audience_choices,
    'promotional_discount_type': promotional_discount_types,
    'promotional_discount_value': promotional_discount_values,
    'gender': gender,
    'age_group': age_group,
    'is_click': is_click,
    'customer_id': customer_ids
})

DataFrame Initialization: A new DataFrame named campaign_df is created using the pd.DataFrame() constructor from the Pandas library. This DataFrame will encapsulate all the attributes associated with each simulated marketing campaign.
Column Assignments:
- campaign_id: Unique IDs generated for each campaign.
- campaign_type: The type of campaign (e.g., Mega Sales, Flash Sale).
- campaign_cost: The cost associated with each campaign.
- session_id: Unique session IDs for tracking campaign interactions.
- start_date and end_date: Dates indicating the duration of each campaign.
- marketing_channel: Randomly selected marketing channels for the campaign, weighted by predefined probabilities.
- promotional_discount_type and promotional_discount_value: Types and values of discounts applied in the campaigns.
- gender and age_group: Demographic information of the customers targeted by the campaigns.
- is_click: Indicates whether a click occurred on the campaign (binary value).
- customer_id: Customer IDs associated with each campaign entry.

8. Merging to get Target Audience Column

# Merge to get target_audience column
campaign_df = campaign_df.merge(customers_df['customer_id', 'target_audience'](/7-teens/7-teens-DSA3101-2410-Project/wiki/'customer_id',-'target_audience'), on='customer_id', how='left')
campaign_df['target_audience'].fillna('new customer', inplace=True)

Merging Data: The campaign data frame is merged with an existing data frame customers_df to include a target_audience column. This column specifies whether the customer is a new or returning customer.
The merge is performed on the customer_id column, and it uses a left join to ensure all campaign entries are retained.
Handling Missing Values: After the merge, any missing values in the target_audience column are filled with the string 'new customer', indicating that customers without a specified audience type are considered new.

Shopee Sales Dataset

np.random.seed(42)
num_rows = 100000

The simulation begins by setting a random seed to ensure reproducibility. A total of 100,000 rows is created for the marketing dataset.

1. Sales Data Initialisation

target_categories = ['Home & Living', 'Health & Beauty', 'Mobile & Accessories']
category_sales_prob = products_cleaned['main_category'].value_counts(normalize=True)
category_sales_prob.update(category_sales_prob[target_categories] * 2)
category_sales_prob = category_sales_prob / category_sales_prob.sum()
product_category_prob = products_cleaned['main_category'].map(category_sales_prob)
product_category_prob = product_category_prob / product_category_prob.sum()

Target Categories: A list of target product categories is defined for emphasis in sales. With reference to 'Top Selling Product Categories in Shopee in Southeast Asiaarticle,Home & Living`, 'Health & Beauty' and 'Electronics' are the top three most popular categories for Shopee Singapore (Involve Asia,2024).
Category Sales Probability: The probabilities for product categories are calculated based on the existing product dataset (products_cleaned), emphasizing the target categories by doubling their probabilities.
Normalized Probabilities: The probabilities are normalized to ensure they sum to one, allowing for proper sampling later.

2. Generating Order IDs and Session IDs

order_id = [np.random.randint(100000, 10000000) for _ in range(num_rows)]
unique_session_ids = campaign_df['session_id'].unique()
np.random.shuffle(unique_session_ids)
extended_session_ids = unique_session_ids[:100000]

Order IDs: Random integers are generated to represent unique order IDs for each sales transaction.
Session IDs: Unique session IDs are extracted from the campaign_df data frame, shuffled, and limited to the first 100,000 entries for use in the sales dataset.

3. Product and Customer ID Generation

duplicated_product_ids = np.random.choice(products_cleaned['product_id'], size=num_rows, replace=True, p=product_category_prob)
duplicated_customer_ids = np.random.choice(campaign_df['customer_id'], size=num_rows, replace=True)
all_campaign_types = campaign_df['campaign_type'].unique().tolist() + ['Regular day']

Duplicated Product IDs: Randomly selects product IDs from the existing products_cleaned dataset with probabilities based on the earlier calculated product category probabilities.
Duplicated Customer IDs: Randomly selects customer IDs from the campaign_df data frame, allowing for potential repeat customers in the sales dataset. This selection allows the simulation of multiple sales per customer, reflecting the reality that many customers will participate in multiple campaigns over time.

4. Campaign Type Assignment

# Define probabilities for each campaign type, including 'Regular day'
probabilities = [0.075, 0.075, 0.075, 0.4, 0.15, 0.15, 0.075]  # Ensure this matches the number of campaign types

Defining Probabilities: This line establishes a list of probabilities that correspond to different campaign types. The probabilities are intentionally crafted to reflect the expected frequency of each campaign type being run during the sales period.
Campaign Types and Their Weights:
- Flash Sale: 7.5%
- Seasonal Sales: 7.5%
- Bundle Promotions: 7.5%
- Mega Sales: 40% (highest likelihood)
- Livestream Exclusive: 15%
- Next Day Delivery: 15%
- Regular Day: 7.5% (the least likely to appear as a campaign)
Weighted Distribution: By assigning a higher probability to "Mega Sales," the simulation reflects the real-world marketing strategy where major events are emphasised due to their potential to drive significant sales. This distribution is key for accurately modelling the marketing landscape, as it highlights the campaigns that are likely to be prioritised by Shopee during high-traffic huge sales periods.

5. Randomised Selection of Campaign Types

duplicated_campaign_types = np.random.choice(all_campaign_types, size=num_rows, p=probabilities)# Set is_campaign based on the campaign type

Campaign Type Assignment: This line uses NumPy's random.choice() function to randomly assign a campaign type to each of the num_rows (100,000) entries in the dataset. The p=probabilities argument ensures that the selection adheres to the predefined probabilities.
Impact on the Dataset:
- The output duplicated_campaign_types is an array containing the campaign type assigned to each order in the sales dataset.
- This approach allows the dataset to realistically reflect the distribution of campaign types that Shopee might run, providing a foundation for analyzing the effectiveness of each campaign type later on.

6. Generating the is_campaign flag

is_campaign = np.where(duplicated_campaign_types == 'Regular day', 0, 1)

Binary Flag Creation: The is_campaign variable is created using the np.where() function, which generates a binary flag for each order based on whether it is part of a promotional campaign.

Flag Assignment:
- If the campaign type is "Regular day," the flag is set to 0, indicating that there is no promotional campaign associated with this order.
- For any other campaign type (e.g., Mega Sales, Flash Sale), the flag is set to 1, indicating that the order is part of a promotional campaign.

Significance of the Is Campaign Flag: This binary flag is crucial for subsequent analyses as it allows for the differentiation between promotional and non-promotional sales. * * Analysts can easily filter the dataset to evaluate:

The effectiveness of various campaigns in generating sales.
The impact of campaigns on overall revenue and customer engagement.

7. Merging Campaign and Product Data

campaign_data = pd.DataFrame({
    'campaign_type': duplicated_campaign_types,
    'is_campaign': is_campaign
})

Campaign Data Frame Creation: A new data frame named campaign_data is created to store campaign_type and 'is_campaign'.

campaign_type: This column contains the types of campaigns assigned to each order (e.g., Mega Sales, Flash Sale).
is_campaign: This column is a binary flag indicating whether an entry is part of a campaign (1) or not (0), providing a quick reference for filtering and analysis later.

product_df = products_cleaned['product_id', 'main_category', 'price_actual', 'total_sold', 'title', 'Stock'](/7-teens/7-teens-DSA3101-2410-Project/wiki/'product_id',-'main_category',-'price_actual',-'total_sold',-'title',-'Stock').rename(columns={'price_actual': 'price'})

Columns selected from the products_cleaned dataset:

product_id: Unique identifier for each product.
main_category: Category to which the product belongs.
price_actual: The actual price of the product (renamed to price for clarity).
total_sold: The total quantity sold for the product.
title: The product title.
Stock: The number of products available for sale.

8. Creating the final Orders Data Frame

orders = pd.DataFrame({
    'order_id': order_id,
    'product_id': duplicated_product_ids,
    'session_id': extended_session_ids,
    'is_campaign': is_campaign,
    'campaign': duplicated_campaign_types
})

Orders Data Frame Creation: A new Data Frame named orders is created to store details of each sales transaction:

order_id: Unique identifier for each order, ensuring that every transaction can be referenced distinctly. * product_id: The ID of the product associated with the order, linking it to the product information.
session_id: The session ID under which the order was placed, allowing tracking of user interactions.
is_campaign: The binary flag indicating whether the order is part of a promotional campaign.
campaign: The type of campaign associated with the order.

9. Merging with (product_df) Product Information

orders = orders.merge(product_df, on='product_id', how='left')

Merging with Product Data: The orders data frame is merged with the product_df data frame based on the product_id column.

Left Join: The how='left' argument ensures that all entries in the orders data frame are retained, while only matching product details from product_df are added. This means that even if an order references a product that does not exist in product_df, it will still be included in the orders data frame, but without product details.
Result of Merge: After this operation, the orders data frame now includes additional columns from product_df, such as main_category, price, total_sold, title, and Stock, enriching the dataset with product-related information.

10. Merging with campaign_df

orders = orders.merge(campaign_df['session_id','customer_id', 'campaign_type', 'promotional_discount_type', 'gender'](/7-teens/7-teens-DSA3101-2410-Project/wiki/'session_id','customer_id',-'campaign_type',-'promotional_discount_type',-'gender'), on='session_id', how='left')
orders['campaign'] = orders['campaign_type']
orders.drop(columns=['campaign_type'], inplace=True)

Merging with Campaign Data: The orders data frame is merged with the campaign_df to incorporate campaign-related information.

Columns Selected from campaign_df:
- session_id: Used as the key for merging.
- customer_id: The ID of the customer who made the order.
- campaign_type: The type of campaign associated with the order.
- promotional_discount_type: The type of discount applicable to the campaign.
- gender: The gender of the customer who placed the order.
Left Join: Similar to the previous merge, how='left' is used to ensure that all orders are preserved, and only matching campaign details are added.

After joining with campaign_df, 'campaign' and 'campaign_type' columns contain the same information, hence 'campaign_type' is dropped.

Updating the Campaign Column: The campaign column in orders is populated with the values from campaign_type for clarity.
Dropping the Campaign Type Column: The original campaign_type column is dropped from the orders DataFrame, as it is redundant now that its information has been transferred to the campaign column.

11. Defining Price Bins and Labels

price_bins = [0, 20, 50, 100, 500, float('inf')]
price_labels = ['Low', 'Mid-Low', 'Mid', 'Mid-High', 'High']
orders['price_range'] = pd.cut(orders['price'], bins=price_bins, labels=price_labels)

Price Bins: The price_bins list defines the boundaries for categorizing product prices:
0 to 20: Represents low-priced items.
20 to 50: Represents mid-low priced items.
50 to 100: Represents mid-priced items.
100 to 500: Represents mid-high priced items.
500 and above: Represents high-priced items, indicated by float('inf'), which stands for infinity. This allows any price greater than 500 to fall into this category.
The price_labels list provides human-readable names, so that classification of prices is understandable.
Price Range Categorisation:
- The pd.cut() function from the Pandas library is used to categorise the price column in the orders DataFrame into discrete price ranges defined by price_bins.

Importance of Price Range Categorisation:

Enhanced Analysis:

Categorizing prices into ranges allows for more straightforward analysis of sales performance across different segments. Analysts can quickly identify which price ranges are most popular among customers and which contribute most significantly to revenue.

Targeted Marketing Strategies:

By understanding how products are distributed across price ranges, Shopee can tailor its marketing strategies. For example, if a significant number of sales come from the "Mid" price range, Shopee might focus on promotional campaigns for products in that segment.

12. Defining Sales Dates Defining High-Priority Mega Sales Dates:

high_priority_mega_sales_dates = ['2019-11-11', '2019-09-09', '2019-10-10', '2019-12-12']

High-Priority Dates: This list specifies dates of significant promotional events, known as mega sales, that are expected to drive a large volume of sales.

Based on personal observation, these dates are major shoppign events for Shopee Singapore where there would be exclusive deals and promotions.

Defining Additional Mega Sales Dates:

additional_mega_sales_dates = [
    '2019-01-01', '2019-02-02', '2019-03-03', '2019-04-04', '2019-05-05',
    '2019-06-06', '2019-07-07', '2019-08-08'
]

Other Mega Sales date that may not be as high-priority but still play a role in sales throughout the year.

Defining Seasonal Sales Dates:

other_seasonal_sales_dates = ['2019-01-01', '2019-02-05', '2019-04-19', '2019-05-01', '2019-06-05',
                                       '2019-06-07', '2019-08-09', '2019-08-11', '2019-10-06', '2019-10-27', '2019-12-02',
                                       '2019-12-25', '2019-12-31']

Seasonal Dates: This list includes various seasonal sales dates that are traditionally associated with increased shopping activity, such as holidays and festivals.

Defining Additional Priority Dates:

high_priority_seasonal_sales_dates = ['2019-11-29']  # Black Friday
early_december_dates = pd.date_range('2019-12-01', '2019-12-24').to_list()

Black Friday and dates leading up to Christmas may have the potential to drive sales as it is a period where people tend to spend more and purchase gifts for their loved ones. We prioritise these dates in order to make our dataset more realistic.

14. Consolidating Dates and their Probabilities

mega_sales_dates = pd.to_datetime(high_priority_mega_sales_dates + additional_mega_sales_dates)
seasonal_sales_dates = pd.to_datetime(high_priority_seasonal_sales_dates + early_december_dates)

exclude = list(mega_sales_dates) + list(seasonal_sales_dates)
regular_dates = pd.date_range('2019-01-01', '2019-12-31', freq='D').difference(exclude)

Converting to Datetime: The sales date lists are combined and converted into datetime objects using pd.to_datetime().
Exclusion (exclude) List: A list of all mega and seasonal sales dates is created to use for filtering.
Regular Dates Identification: A complete date range for the year 2019 is generated, and the difference() method is used to exclude the previously defined sales dates, resulting in a list of regular days when no special campaigns are active.

Assigning Probabilities to Mega Sales Dates:

mega_priority_probs = [0.5 if date == '2019-11-11' else 0.2 if date in ['2019-09-09', '2019-10-10', '2019-12-12'] else 0.01 for date in mega_sales_dates]

11.11: Assigned a higher probability (0.5) to reflect its importance.
Other Important Dates: Dates like 9.9, 10.10, and 12.12 are given a moderate probability (0.2).
Less Significant Dates: Remaining dates receive a lower probability (0.01), indicating they are less critical in terms of expected sales volume.

Assigning Probabilities to Seasonal Sales Dates:

seasonal_priority_probs = [0.3 if date == '2019-11-29' else 0.02 if date in other_seasonal_sales_dates else 0.4 / len(early_december_dates) for date in seasonal_sales_dates]

Black Friday: Assigned a higher probability (0.3).
Other Seasonal Dates: Some dates are given a low probability (0.02), while the early December dates receive a uniform probability calculated by dividing a set value (0.4) by the number of dates in that range.

Normalising Probabilities for Mega and Seasonal Sales:

total_mega_prob = sum(mega_priority_probs)
mega_priority_probs = [p / total_mega_prob * 0.8 for p in mega_priority_probs]
total_seasonal_prob = sum(seasonal_priority_probs)
seasonal_priority_probs = [p / total_seasonal_prob * 0.2 for p in seasonal_priority_probs]

Total Probability Calculation: The total probabilities for mega and seasonal dates are calculated, which allows for normalisation.
Normalisation: The probabilities are adjusted so that they represent a proportion of the overall distribution (0.8 for mega sales and 0.2 for seasonal sales). This ensures that the overall sampling distribution is clear and consistent.

Combining All Dates and Their Probabilities

all_dates = list(mega_sales_dates) + list(seasonal_sales_dates) + list(regular_dates)
combined_probs = mega_priority_probs + seasonal_priority_probs + [0.01 / len(regular_dates)] * len(regular_dates)  # Small probability for regular dates

combined_probs = np.array(combined_probs) / sum(combined_probs)

Combining Dates: All mega sales dates, seasonal sales dates, and regular dates are combined into a single list.
Assigning Small Probability to Regular Dates: Regular dates are assigned a very low probability (0.01 divided by the number of regular dates), indicating their lower likelihood of being selected compared to promotional dates.
Probability Normalisation: The combined probabilities are normalised to ensure that the total sums to 1.

By capturing high-priority mega sales, seasonal sales, and regular days, the dataset is structured to reflect real-world e-commerce dynamics, allowing for meaningful analysis of sales patterns and marketing effectiveness.

13. Assigning Order Times

order_times = np.random.choice(all_dates, size=num_rows, p=combined_probs)
# Assign generated order times to the 'order_time' column in the 'orders' DataFrame
orders['order_time'] = order_times

The np.random.choice() function is used to randomly select order times from the combined list of mega sales dates, seasonal sales dates, and regular dates.
The size=num_rows parameter specifies that 100,000 order times should be generated.
The p=combined_probs argument ensures that the selection is influenced by the defined probabilities, making it more likely for orders to be assigned to promotional days (like mega sales) compared to regular days.

Updating the data frame:

The generated order times are then assigned to a new column called order_time in the orders DataFrame. Each order will now have a corresponding date that reflects when it is believed to have been placed, mirroring real-world shopping behavior during sales events.

15. Assigning Discounts

orders['discount'] = 0

# Apply discounts for 'Percentage' type based on campaign type and order time
percentage_discount_conditions = (
    (orders['promotional_discount_type'] == 'Percentage') &
    (orders['is_campaign'] == 1)
)

for i in range(len(orders)):
    # Check if the row meets the percentage discount conditions
    if percentage_discount_conditions[i]:

        # Apply discount based on specific campaign conditions
        if orders.loc[i, 'order_time'] == '2019-11-11' and orders.loc[i, 'campaign'] == 'Mega Sales':
            orders.loc[i, 'discount'] = np.random.randint(20, 51)

        elif orders.loc[i, 'campaign'] == 'Flash Sale':
            orders.loc[i, 'discount'] = np.random.randint(15, 31)

        elif orders.loc[i, 'campaign'] == 'Seasonal Sales':
            orders.loc[i, 'discount'] = np.random.randint(20, 41)

        elif orders.loc[i, 'campaign'] == 'Bundle promotions':
            orders.loc[i, 'discount'] = np.random.randint(5, 16)

        elif orders.loc[i, 'campaign'] == 'Livestream Exclusive':
            orders.loc[i, 'discount'] = np.random.randint(25, 41)

        elif orders.loc[i, 'campaign'] == 'Next Day Delivery':
            orders.loc[i, 'discount'] = np.random.randint(0, 11)
        else:
            orders.loc[i, 'discount'] = 0
    else:
        # Set discount to 0 if no percentage discount condition is met
        orders.loc[i, 'discount'] = 0



# Apply discounts for 'Fixed Amount' type, ensuring discount is less than the product price
fixed_amount_discount_conditions = (
    (orders['promotional_discount_type'] == 'Fixed Amount') &
    (orders['is_campaign'] == 1)
)

# Generate discount values for Fixed Amount and ensure they do not exceed the product price
for idx in orders[fixed_amount_discount_conditions].index:
    max_discount = max(1, orders.loc[idx, 'price'] - 1)  # Ensure a minimum of 1 and max less than price
    orders.loc[idx, 'discount'] = np.random.randint(1, max_discount + 1)

Discount Initialisation:

A new column named discount is created in the orders DataFrame and initialised to zero for all entries. This sets up a placeholder for future discount values that will be applied based on specific conditions.

Defining Conditions for Percentage Discounts

The promotional_discount_type must be equal to 'Percentage', indicating that a percentage-based discount is applicable.
The is_campaign flag must be 1, meaning the order is part of a promotional campaign.

Specific Discounts are Applied Based on Campaign Conditions

We specified different discount range for each campaign type to simulate the types of discounts typically offered during Shopee Singapore promotional campaigns. For instance, major events like Mega Sales (such as 11.11 Singles' Day) are among the biggest and most significant campaigns for Shopee, often featuring substantial discounts and deals. As such, we have assigned the highest discount ranges to these campaigns, reflecting the substantial discounts Shopee offers during these high-traffic periods.
This approach aims to mirror real-life practices, ensuring that the synthetic dataset accurately reflects the promotional dynamics Shopee experiences during key sales events.
Campaign-Specific Percentage Discounts:
- Mega Sales: If the order time is November 11 and the campaign is "Mega Sales," a discount between 20% and 50% is applied.
- Flash Sale: A discount between 15% and 30% is applied for orders associated with Flash Sales.
- Seasonal Sales: A discount between 20% and 40% is assigned for Seasonal Sales.
- Bundle Promotions: A discount between 5% and 15% is applied.
- Livestream Exclusive: A discount between 25% and 40% is given.
- Next Day Delivery: A minimal discount between 0% and 10% is assigned. The discounts are generated randomly within the specified ranges using np.random.randint(), which produces integer values.If none of the conditions are met, the discount remains as zero, or the code explicitly sets it to zero if the order does not qualify for a percentage discount.
Fixed Amount Discounts:
- A similar approach is taken to identify fixed amount discounts. The condition checks whether the promotional_discount_type is set to "Fixed Amount" and whether the order is part of a campaign.
- For each order that qualifies, the maximum discount is calculated as the product's price minus one (to ensure the discount does not exceed the product price), and a random discount value between 1 and this maximum is assigned.

SQL Qeries Used for Data Transformation (with explanations)

SQL Queries for Marketing Channels and Campaigns dataset

Using SQL for EDA on the Marketing DataFrame

We want to conduct simple Exploratory Data Analysis to get a grasp of key patterns in our dataset.

We query on the ROI of each marketing channel to determine ROI in terms of cost effectiveness
We query on the total visitors and sales generated from each marketing channel to derive visitor-to-sales ratio to determine the effectiveness of each marketing channel in generating sales
We query on the engagement rates, bounce rates, and CTR to determine engagement of each marketing channel

# We create an SQL database and save the DataFrame to an SQL table

conn = sqlite3.connect('marketing_analysis.db')
cursor = conn.cursor()
marketing_df.to_sql('marketing_data', conn, if_exists='replace', index=False)

Database Connection:
- sqlite3.connect('marketing_analysis.db') establishes a connection to a SQLite database file called marketing_analysis.db. If this file doesn't exist, it will be created.
- The cursor object is used to interact with the database, executing SQL queries.
Saving DataFrame to SQL Table:
- The marketing_df (a Pandas DataFrame containing marketing campaign data) is saved into an SQL table named marketing_data.
- if_exists='replace' ensures that if the table already exists, it will be replaced with the new data.
- index=False prevents the index from being saved as an extra column in the SQL table.

First Query: Calculating ROI (Return on Investment) Based on Channel Expenditure and Revenue

SELECT
    channel_name,
    SUM(channel_expenditure) AS total_expenditure,
    SUM(total_revenue) AS total_revenue,
    (SUM(total_revenue) - SUM(channel_expenditure)) / SUM(channel_expenditure) * 100 AS ROI
FROM
    marketing_data
GROUP BY
    channel_name
ORDER BY
    ROI DESC;

# Execute the query and load results into a Data Frame
roi_results = pd.read_sql_query(first_query, conn)
print('ROI based on channel expenditure and revenue')
print(roi_results)
print("\n")

Purpose: This query calculates the Return on Investment (ROI) for each marketing channel based on the expenditure and revenue data in the marketing_data table.

Breakdown of the Query:

SUM(channel_expenditure) AS total_expenditure: Calculates the total expenditure for each marketing channel.
SUM(total_revenue) AS total_revenue: Calculates the total revenue generated for each marketing channel.
(SUM(total_revenue) - SUM(channel_expenditure)) / SUM(channel_expenditure) * 100 AS ROI: The ROI is calculated as the difference between total revenue and total expenditure, divided by total expenditure, and then multiplied by 100 to express it as a percentage.
GROUP BY channel_name: Groups the data by marketing channel to calculate these values for each channel.
ORDER BY ROI DESC: Orders the channels by ROI in descending order, so the channels with the highest ROI appear first.

Result: This query provides the ROI for each marketing channel, which helps to assess the efficiency of marketing expenditures.

Second Query: Analysis of Visitors and Sales

SELECT
    channel_name,
    SUM(channel_visitors) AS total_visitors,
    SUM(channel_sales) AS total_sales,
    COALESCE(SUM(channel_sales), 0) * 1.0 / NULLIF(SUM(channel_visitors), 0) *100 AS visitor_to_sales_rate
FROM
    marketing_data
GROUP BY
    channel_name
ORDER BY
    total_sales DESC;

visitor_sales_results = pd.read_sql_query(second_query, conn)
print('Visitors and sales')
print(visitor_sales_results)
print("\n")

Purpose: This query calculates the relationship between visitors and sales for each marketing channel, as well as the visitor-to-sales conversion rate.

Breakdown of the Query:

SUM(channel_visitors) AS total_visitors: Sums the number of visitors for each marketing channel.
SUM(channel_sales) AS total_sales: Sums the total sales for each channel.
COALESCE(SUM(channel_sales), 0) * 1.0 / NULLIF(SUM(channel_visitors), 0) * 100 AS visitor_to_sales_rate: Calculates the conversion rate (i.e., how many visitors lead to sales).
COALESCE(SUM(channel_sales), 0): Ensures that if there are no sales (i.e., NULL), it will be counted as zero.
NULLIF(SUM(channel_visitors), 0): Prevents division by zero if there are no visitors.
* 100: Converts the result to a percentage.
GROUP BY channel_name: Groups the data by marketing channel to calculate these values for each channel.
ORDER BY total_sales DESC: Orders the channels by total sales in descending order, with the channels generating the highest sales appearing first.

Result: This query provides insights into the number of visitors and the conversion rate for each marketing channel, helping assess how effectively different channels turn traffic into sales.

Third Query: Analysis of Engagement Metrics (Conversion Rate, Bounce Rate, Click-Through Rate)

SELECT
    channel_name,
    AVG(conversion_rate) AS avg_conversion_rate,
    AVG(bounce_rate) AS avg_bounce_rate,
    AVG(click_through_rate) AS avg_ctr
FROM
    marketing_data
GROUP BY
    channel_name
ORDER BY
    avg_conversion_rate DESC;

engagement_results = pd.read_sql_query(third_query, conn)
print('Average engagement rates')
print(engagement_results)
print("\n")

Definitions:

Conversion Rate: The percentage of visitors who take a desired action on a website or app. In an e-commerce context, this typically refers to the percentage of visitors who make a purchase after visiting the website or clicking on a product.
- Conversion rate helps evaluate the effectiveness of marketing campaigns and strategies. A higher conversion rate indicates that more visitors are successfully completing the desired action, like making a purchase, which means the campaign is effectively targeting the right audience.
Bounce Rate (BR): the percentage of visitors who land on a website and leave without interacting further (i.e., without visiting another page or completing any desired action like a purchase, sign-up, etc.).
- A high bounce rate typically indicates that visitors did not find what they were looking for or that the website's content, design, or user experience is not engaging. A low bounce rate means visitors are exploring more pages or staying longer, which is generally a positive indicator.
- Understanding the bounce rate helps businesses improve their website’s layout, content, and usability. Reducing the bounce rate often involves optimising the website to be more engaging and user-friendly.
Click-Through Rate (CTR): The percentage of people who click on a specific link, ad, or call-to-action (CTA) after being exposed to it. In the context of digital marketing, it often measures the effectiveness of online ads, emails, or product listings.
- CTR is commonly used to measure the performance of display ads, email campaigns, or social media promotions. A high CTR suggests that the ad or campaign is relevant and engaging, and that the audience finds the message compelling enough to take action (i.e., click).

Purpose: This query calculates the average engagement metrics for each marketing channel, including the conversion rate, bounce rate, and click-through rate (CTR).

Breakdown of the Query:

AVG(conversion_rate) AS avg_conversion_rate: Calculates the average conversion rate for each marketing channel. The conversion rate typically represents the percentage of visitors who make a purchase.
AVG(bounce_rate) AS avg_bounce_rate: Calculates the average bounce rate, which represents the percentage of visitors who leave a webpage without interacting further.
AVG(click_through_rate) AS avg_ctr: Calculates the average click-through rate (CTR), which represents the percentage of users who click on a link, ad, or other digital content.
GROUP BY channel_name: Groups the data by marketing channel to calculate these metrics for each channel.
ORDER BY avg_conversion_rate DESC: Orders the channels by the average conversion rate in descending order, highlighting the channels with the best conversion performance.

Result: This query helps to assess how effective each marketing channel is at converting visitors into customers, keeping them engaged, and encouraging interactions.

Putting together a channel summary

Calculate total metrics per channel

channel_summary = marketing_df.groupby('channel_name').agg({
    'channel_expenditure': 'sum',
    'total_revenue': 'sum',
    'channel_visitors': 'sum',
    'channel_sales': 'sum',
    'average_order_value' : 'mean'
}).reset_index()

Purpose: This step aggregates the total metrics by marketing channel, calculating the following:

channel_expenditure: Total spending per channel.
total_revenue: Total revenue generated per channel.
channel_visitors: Total number of visitors from each channel.
channel_sales: Total sales per channel.
average_order_value: Average order value per channel.

Calculate ROI, Cost per Visitor and Cost per Sale

channel_summary['ROI'] = ((channel_summary['total_revenue'] - channel_summary['channel_expenditure']) /
                          channel_summary['channel_expenditure']) * 100
channel_summary['cost_per_visitor'] = channel_summary['channel_expenditure'] / channel_summary['channel_visitors']
channel_summary['cost_per_sale'] = channel_summary['channel_expenditure'] / channel_summary['channel_sales']

Purpose: For each channel, we calculate the:

ROI: Measures the return on investment for each channel as a percentage.
Cost per Visitor (CPV): The average cost of acquiring each visitor from the channel.
Cost per Sale (CPS): The average cost incurred for each sale made through the channel.

Calculate the mean engagement and conversion metrics by channel

engagement_metrics = marketing_df.groupby('channel_name').agg({
    'conversion_rate': 'mean',
    'click_through_rate': 'mean',
    'bounce_rate': 'mean'
}).reset_index()

Purpose: We compute engagement metrics for each channel:

Conversion Rate: The average rate at which visitors convert into customers.
Click Through Rate (CTR): The average rate at which users clicked on ads or links per channel.
Bounce Rate: The average percentage of visitors who leave without further interaction.

Using SQL for EDA on Campaign DataFrame

# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
# Write DataFrames to SQL tables
sales_df.to_sql('sales', conn, index=False, if_exists='replace')
campaign_df.to_sql('campaigns', conn, index=False, if_exists='replace')

Writing DataFrames to SQL Tables:

sales_df.to_sql('sales', conn, index=False, if_exists='replace'):
- This command takes the sales_df DataFrame and writes it into a table named sales in the SQLite database (conn).
- index=False: This parameter ensures that the DataFrame’s index is not written to the database as a separate column. If index=True, it would include the DataFrame's index as a column in the SQL table.
- if_exists='replace': This specifies that if a table named sales already exists in the database, it will be replaced with the new sales_df data.
campaign_df.to_sql('campaigns', conn, index=False, if_exists='replace'):
- Similarly, this command writes the campaign_df DataFrame to a table named campaigns in the SQLite database.
- The same parameters are used here (index=False and if_exists='replace'), ensuring that the table is overwritten if it already exists.

First Query: Calculating AOV of each campaign

Campaign Sales CTE

WITH campaign_sales AS (
    SELECT
        s.order_id,
        s.customer_id,
        c.campaign_id,
        c.campaign_type AS campaign,
        c.marketing_channel,
        s.price,
        s.discount,
        s.order_time
    FROM
        sales s
    JOIN
        campaigns c ON s.session_id = c.session_id
    WHERE
        s.is_campaign = 1
        AND s.order_time BETWEEN c.start_date AND c.end_date
),

Purpose: This part of the query creates a temporary view (CTE) named campaign_sales that gathers relevant sales data linked to campaigns.

Join Condition: The sales table (s) and the campaigns table (c) are joined on the session_id. This ensures that each sale is matched to the appropriate campaign.

Filters Applied:

s.is_campaign = 1: Ensures that only sales associated with campaigns are included.
s.order_time BETWEEN c.start_date AND c.end_date: Filters sales to only those that occurred within the campaign's active period (from the campaign start date to its end date).

Columns Selected: order_id, customer_id, campaign_id, campaign_type (campaign name), marketing_channel (platform/channel used), price (price of the product), discount (any discounts applied), and order_time.

This CTE serves the foundation for analysing the sales data related to each campaign

Campaign Summary CTE

campaign_summary AS (
    SELECT
        campaign,
        marketing_channel,
        COUNT(DISTINCT order_id) AS total_orders,
        SUM(price) AS total_sales,
        (SUM(price) / COUNT(DISTINCT order_id)) AS average_order_value,
        AVG(discount) AS average_discount,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM
        campaign_sales
    GROUP BY
        campaign
)

Purpose: The second CTE, campaign_summary, aggregates the data from campaign_sales to calculate key metrics for each campaign. Metrics Calculated:

COUNT(DISTINCT order_id) AS total_orders: Calculates the total number of unique orders for each campaign.
SUM(price) AS total_sales: Sums the total sales (revenue) generated by each campaign.
(SUM(price) / COUNT(DISTINCT order_id)) AS average_order_value: Calculates the Average Order Value (AOV) by dividing total sales by the number of distinct orders. This gives the average revenue generated per order for each campaign.
AVG(discount) AS average_discount: Calculates the average discount applied across all orders in the campaign.
COUNT(DISTINCT customer_id) AS unique_customers: Counts the number of unique customers who made purchases in the campaign. Grouping: The data is grouped by campaign, which means that the metrics will be calculated for each individual campaign.

This CTE aggregates the sales and customer data for each campaign, making it easier to evaluate the performance of each campaign in terms of total sales and AOV.

Final Select and Ordering

SELECT
    campaign,
    marketing_channel,
    total_orders,
    total_sales,
    average_order_value,
    average_discount,
    unique_customers

FROM
    campaign_summary
ORDER BY
    campaign;

aov_result_df = pd.read_sql_query(query_aov, conn)

Purpose: This final SELECT statement retrieves the summarized metrics from the campaign_summary CTE. Columns Selected:

campaign: The name/type of the campaign (e.g., Mega Sale, Flash Sale).
marketing_channel: The platform/channel through which the campaign was run (e.g., In-App, Social Media).
total_orders: The total number of unique orders in the campaign.
total_sales: The total revenue generated by the campaign.
average_order_value: The average order value (AOV) for the campaign.
average_discount: The average discount applied during the campaign.
unique_customers: The number of unique customers who made purchases in the campaign. Ordering: ORDER BY campaign: This orders the final result by the campaign column alphabetically, making it easier to view the metrics campaign by campaign.

Query Outcome:

AOV: Calculated as the total sales divided by the total number of orders. This helps assess how much customers are spending on average during each campaign.
Total Orders and Sales: Measures the overall volume and revenue generated by the campaign.
Average Discount: Reflects how much of a discount was given on average across all orders in the campaign.
Unique Customers: Provides insight into how many distinct customers participated in the campaign.

Second Query: Evaluate total sales of each campaign by category

Individually Filtered Sales Data by Category (indiv_category CTE)

WITH indiv_category AS (
  SELECT
      s.order_id,
      s.customer_id,
      s.main_category AS category,
      c.campaign_id,
      c.campaign_type AS campaign,
      c.target_audience,
      s.price,
      s.discount,
      s.order_time
  FROM
      sales s
  JOIN
      campaigns c on s.session_id = c.session_id
  WHERE
      s.is_campaign = 1
      AND s.order_time BETWEEN c.start_date AND c.end_date
)

Purpose: This Common Table Expression (CTE) creates a temporary table indiv_category, which aggregates sales data by individual product categories and campaign details. Join Condition: The sales table (s) is joined with the campaigns table (c) on the session_id, which links each sale to a specific campaign. Filters Applied:

s.is_campaign = 1: This ensures that only sales that are part of a campaign are included.
s.order_time BETWEEN c.start_date AND c.end_date: This condition filters out sales that occurred outside the campaign period, ensuring that only the sales made during the active campaign timeframe are considered.
Columns Selected:
- order_id: Unique identifier for each order.
- customer_id: Unique identifier for each customer.
- main_category AS category: The product category for each order, renamed as category for clarity.
- campaign_id, campaign_type AS campaign: The campaign ID and type (name) associated with each sale.
- target_audience: The target audience for the campaign (e.g., new or returning customers).
- price, discount, order_time: The price of the product, any discounts applied, and the time when the order was placed.

Aggregating Sales Data by Campaign and Category (category_summary CTE)

,category_summary AS(
  SELECT
      campaign,
      category,
      COUNT(DISTINCT order_id) AS total_orders,
      SUM(price) AS total_sales,
      COUNT(DISTINCT customer_id) AS unique_customers,
      (SUM(price) / COUNT(DISTINCT order_id)) AS AOV

  FROM indiv_category
  GROUP BY campaign, category
)

Purpose: The category_summary CTE aggregates the sales data by campaign and product category, providing key metrics for each combination of campaign and category. Metrics Calculated:

COUNT(DISTINCT order_id) AS total_orders: The total number of unique orders within each category and campaign combination.
SUM(price) AS total_sales: The total revenue generated from all orders in each category and campaign combination.
COUNT(DISTINCT customer_id) AS unique_customers: The number of unique customers who made purchases in each campaign and category.
(SUM(price) / COUNT(DISTINCT order_id)) AS AOV: The Average Order Value (AOV) for each category and campaign combination, calculated as the total sales divided by the total number of distinct orders. Grouping: The results are grouped by both campaign and category to calculate the metrics for each product category within each campaign.

Sorting Campaigns by AOV (aov_campaign CTE)

,aov_campaign AS(
  SELECT
    campaign,
    unique_customers,
    total_orders,
    total_sales,
    AOV
  FROM category_summary
  ORDER BY AOV DESC
)

Purpose: The aov_campaign CTE is used to sort the campaigns based on AOV in descending order. This helps identify which campaigns have the highest AOV, providing insight into which campaigns generate more revenue per order. Columns Selected: campaign, unique_customers, total_orders, total_sales, AOV: The key metrics for each campaign and category combination. Sorting: The ORDER BY AOV DESC sorts the results by AOV in descending order, so the campaigns with the highest AOV come first.

Final Selection and Ordering

SELECT
    campaign,
    category,
    unique_customers,
    total_orders,
    total_sales,
    AOV
FROM category_summary
ORDER BY campaign;

Purpose: The final SELECT statement retrieves the summarized results from the category_summary CTE. Columns Selected:

campaign, category: The campaign name and product category.
unique_customers: The number of unique customers who made purchases.
total_orders: The total number of orders placed.
total_sales: The total sales revenue generated.
AOV: The Average Order Value (AOV) for each category and campaign. Ordering: The results are ordered by campaign, which arranges the output by the campaign name alphabetically, making it easier to view campaign performance for each category.

Purpose of Query:

Total Orders: The total number of orders for each campaign and category combination.
Total Sales: The total revenue generated from those orders.
Unique Customers: The number of distinct customers who made purchases in each campaign and category.
AOV (Average Order Value): The average revenue generated per order in each category for each campaign.

Third Query: Breakdown of AOV by campaign and marketing channel

Filtered Campaign Sales Data (campaign_sales CTE)

WITH campaign_sales AS (
    SELECT
        s.order_id,
        s.customer_id,
        c.campaign_id,
        c.campaign_type AS campaign,
        c.marketing_channel,
        c.target_audience,
        s.price,
        s.discount,
        s.order_time
    FROM
        sales s
    JOIN
        campaigns c ON s.session_id = c.session_id
    WHERE
        s.is_campaign = 1
        AND s.order_time BETWEEN c.start_date AND c.end_date
)

Purpose: The campaign_sales CTE filters sales data for transactions associated with campaigns and within the specified campaign dates, linking sales to specific campaigns, marketing channels, and target audiences.

Join Condition: Joins the sales (s) and campaigns (c) tables on session_id to link each sale with its associated campaign.

Filters Applied: s.is_campaign = 1: Ensures only campaign-related sales are included.

s.order_time BETWEEN c.start_date AND c.end_date: Includes sales within each campaign’s active period.

Columns Selected:

order_id, customer_id: Unique identifiers for each order and customer.
campaign_id, campaign, marketing_channel: Campaign details, including type and channel.
target_audience: The intended customer audience (new or returning).
price, discount, order_time: The price and discount on each sale and the order timestamp.

Aggregating Sales Data by Campaign and Marketing Channel (campaign_summary CTE)

campaign_summary AS (
    SELECT
        campaign,
        marketing_channel,
        target_audience,
        COUNT(DISTINCT order_id) AS total_orders,
        SUM(price) AS total_sales,
        (SUM(price) / COUNT(DISTINCT order_id)) AS average_order_value,
        AVG(discount) AS average_discount,
        COUNT(DISTINCT customer_id) AS unique_customers,
        COUNT(DISTINCT CASE WHEN target_audience = 'new customer' THEN customer_id END) AS new_customers,
        COUNT(DISTINCT CASE WHEN target_audience = 'returning customer' THEN customer_id END) AS returning_customers
    FROM
        campaign_sales
    GROUP BY
        campaign, marketing_channel, target_audience
)

Purpose: The campaign_summary CTE aggregates key metrics for each campaign, marketing channel, and target audience combination.

Metrics calculated :

total_orders: The count of unique orders per campaign and channel.
total_sales: The sum of prices for all orders in each campaign-channel combination.
average_order_value: AOV for each campaign and channel, calculated as total_sales / total_orders.
average_discount: The average discount applied in each campaign-channel.
unique_customers: The count of distinct customers per campaign and channel.
new_customers and returning_customers: Distinct counts for new and returning customers.

Grouping: Results are grouped by campaign, marketing_channel, and target_audience.

Final selection and ordering

SELECT
    campaign,
    marketing_channel,
    target_audience,
    total_orders,
    total_sales,
    average_order_value,
    average_discount,
    unique_customers,
    new_customers,
    returning_customers

FROM
    campaign_summary
ORDER BY
    campaign, marketing_channel, target_audience;

Purpose: The final SELECT retrieves and displays the summarized campaign data, providing an ordered view of key metrics for each campaign across channels and target audiences.

Columns Selected:

campaign, marketing_channel, target_audience: Identifiers for each campaign, its marketing channel, and target audience.
total_orders, total_sales: Metrics indicating campaign sales and revenue.
average_order_value, average_discount: Insights into the average value of each order and discounts.
unique_customers, new_customers, returning_customers: Customer metrics to understand customer reach and retention.

Ordering: Results are ordered by campaign, marketing_channel, and target_audience, providing a structured output for easy comparison.

Purpose of query This query is to generate detailed insights on: Total Orders: The volume of orders for each campaign-channel combination. Total Sales: The revenue generated. Average Order Value (AOV): The revenue per order. Customer Metrics: The number of new vs. returning customers engaged in each campaign.

We will later use this SQL query as a tool to visualize AOV trends across the various marketing channels and campaigns.

4: Data Preparation - 7-teens/7-teens-DSA3101-2410-Project GitHub Wiki

Data Cleaning Processes

1. Data Collection & Integration

2. Data Inspection and Cleaning

Key Tasks Covered

Data Inspection for Customer Table

Data Inspection for Products Dataset

Removing Irrelevant Columns

3. Synthetic Data Generation

Making Synthetic Data for Checkout Count in Last 6 Months and 1 Year

Making Synthetic Data for Orders

Code Explanation

1. Define Customer Preferences

2. Special Sales Dates

3. Generate Weighted Probabilities

4. Generate Orders

5. Assign Campaign Information

6. Add last_checkout_day to Customers

7. Output and Summary

Making Synthetic Data for Prices Less Than 10 Cents

WARNING: Need More Research

Feature Engineering Techniques

Data simulation

Step 1: Outlier detection and treatment

Step 2: Preparing synthetic dataset for marketing channels

Marketing Channel Data frame:

Campaign Data Frame

Shopee Sales Dataset

SQL Qeries Used for Data Transformation (with explanations)

SQL Queries for Marketing Channels and Campaigns dataset

Using SQL for EDA on the Marketing DataFrame

First Query: Calculating ROI (Return on Investment) Based on Channel Expenditure and Revenue

Second Query: Analysis of Visitors and Sales

Third Query: Analysis of Engagement Metrics (Conversion Rate, Bounce Rate, Click-Through Rate)

Using SQL for EDA on Campaign DataFrame

First Query: Calculating AOV of each campaign

Second Query: Evaluate total sales of each campaign by category

Third Query: Breakdown of AOV by campaign and marketing channel

6. Add `last_checkout_day` to Customers