RFM CLV Segmentation - grishasen/proof_of_value GitHub Wiki

RFM Segmentation Overview

RFM stands for Recency, Frequency, and Monetary value — three core dimensions used to segment customers based on their purchase behavior:

Recency (R) – How recently a customer made a purchase. Frequency (F) – How often a customer makes a purchase. Monetary Value (M) – How much money a customer spends on average.

RFM segmentation is a widely used marketing technique that helps businesses understand and categorize customers, enabling targeted marketing, improved retention strategies, and better Customer Lifetime Value (CLV) predictions. RFM segmentation is a robust and interpretable method for customer analytics. This implementation supports multi-domain configuration, automatic quartile calculation, and dynamic segment labeling. Segmentation results (segment, score) may be used to tag customers in CDH CIC dataset and then utilized in engagement policies and scheduling segments.

How RFM Segmentation Works in This Codebase

1. Generate RFM Scores

Each customer is assigned a 3-digit RFM score like '344', where: First digit = Recency quartile (1=oldest, 4=newest) Second digit = Frequency quartile (1=least frequent, 4=most frequent) Third digit = Monetary quartile (1=lowest spenders, 4=highest spenders)

2. Define Segments by Domain

Segments are customized for various industries. Each domain groups RFM scores into meaningful customer segments:

🏦 Retail Banking

"Wealth Champions" : High RFM across all – loyal and high-value
"Premier Clients" : Moderately high value, loyal
"Growth Investors" : High frequency and monetary, moderate recency
"At-Risk Clients" : Low recency, at risk of churn
"Dormant Accounts" : All others not in above

📞 Telco

"Platinum Subscribers" : High RFM – most engaged
"Engaged Subscribers" : Active but not top-tier
"Potential Upsell Group" : Good potential to grow
"Churn Risk" : Low recency, medium value
"Dormant Lines" : Inactive users

🛒 E-commerce

"Champions" : Recently purchased, frequently, and high spenders
"Repeat Buyers" : Loyal and moderately valuable
"High-Potential Buyers" : Good spending behavior but slightly lower loyalty
"At-Risk Shoppers" : Haven’t purchased in a while
"Inactive Shoppers" : No reccuring engagement

For general use, a domain-agnostic segment mapping is also provided:

_default_rfm_segment_config = { "Premium Customer": [ "334", "443", "444", "344", "434", "433", "343", "333", ], "Repeat Customer": ["244", "234", "232", "332", "143", "233", "243", "242"], "Top Spender": [ "424", "414", "144", "314", "324", "124", "224", "423", "413", "133", "323", "313", "134", ], "At Risk Customer": [ "422", "223", "212", "122", "222", "132", "322", "312", "412", "123", "214", ], "Inactive Customer": ["411", "111", "113", "114", "112", "211", "311"], }

🔍 The `rfm_summary()` Function

This function computes the full RFM segmentation pipeline from an aggregated customer dataset:

Key Steps:

Aggregate customer transactions by ID.
Calculate RFM metrics:

Frequency = Number of purchases (minus 1) Recency = Days since last purchase Tenure = Time from first to last purchase Monetary Value = Avg. value per unique holding

Assign Quartiles:

Quartile values (1–4) are assigned to each RFM metric. Recency quartiles are reversed (4 = most recent).

Compute RFM Score:

Concatenation of r_quartile, f_quartile, and m_quartile, e.g., '344'.

Map to Segment:

Final label is derived using the selected RFM segment configuration.

Applications

RFM segmentation allows organizations to:

Target marketing campaigns based on behavior
Predict churn with at-risk segments
Upsell/cross-sell to high-potential or loyal customers
Tailor messaging to match engagement levels
Improve CLV models by segmenting value contributors

Configurations & Customization

Segment mappings can be overridden by passing a custom config dictionary with a rfm_segment_config key. Default behavior uses _default_rfm_segment_config. Recency and frequency values are scaled using time_scaler. Customers are encouraged to use own segmentation schemes, though this approach will need to Python code customization.

Why Use Quartiles in RFM?

Normalization Across Metrics Recency, Frequency, and Monetary Value are on different scales (e.g., days vs. count vs. dollars). Quartiles help normalize these values into a common scale (1–4), enabling fair comparison and combination into RFM scores.
Rank-Based (Non-Parametric) Quartiles are based on rank, not raw values. They are robust to outliers and skewed distributions — no assumptions needed about the data distribution.
Simplicity and Interpretability Easy to explain to stakeholders: "Top 25% of spenders" or "Most recent 25% of buyers". Produces intuitive 3-digit scores (e.g., 344), which can be grouped into behavioral segments.
Even Distribution Ensures that each R, F, or M group contains roughly the same number of customers. This helps avoid over-concentration in a single segment and enables better modeling and targeting.

🧩 Alternatives to Quartiles

Here are some other methods to consider based on the business needs or data characteristics:

Fixed Thresholds Define manual cutoffs (e.g., Recency < 30 days = high).

✅ Good for domain experts who want control.
❌ Can become arbitrary or brittle with changing data.

Deciles (or Tertiles, Quintiles, etc.) Use 10 (or other) equal groups instead of 4.

✅ Finer granularity.
❌ More complexity, harder to interpret and group meaningfully.

Z-Scores / Standardization Convert each metric to z-scores: (value - mean) / std.

✅ Good for modeling (e.g., clustering, regression).
❌ Less interpretable, sensitive to outliers.

Clustering (e.g., K-Means) Use unsupervised learning to group customers.

✅ Data-driven, flexible, non-linear.
❌ Requires more tuning and validation; less transparent.

Decision Trees or Rule-Based Binning Automatically create bins using CART-like methods.

✅ Tailored to maximize separation.
❌ Less consistent; may need retraining often.

Quartiles are often the best trade-off between simplicity, interpretability, and robustness — especially in exploratory or customer lifecycle analysis. For more advanced needs, machine learning or statistical methods may offer more precision at the cost of complexity.