A repository dedicated to the analysis of customer segmentation. This project aims to implement and evaluate various segmentation methodologies, drawing inspiration and techniques from current research in the field.
Customer.Segments.Primer.mp4
- A shopping mall aims to improve its marketing strategies and customer engagement.
- The mall currently lacks a deep understanding of its customers, including their diverse profiles and purchasing habits.
- There's also a need to assess the effectiveness of previous marketing campaigns.
- This lack of data-driven insights makes it difficult for the mall to make informed decisions about resource allocation and targeted promotions.
-
Stakeholders expect actionable recommendations derived from the data analysis that can be directly implemented to improve marketing strategies and customer engagement.
-
They anticipate measurable outcomes resulting from the project, such as increased revenue, improved customer loyalty, and a more efficient allocation of marketing resources.
information, annual income, and spending habits.
The Mall Customers Dataset provides data on 200 individuals who visit a mall, including demographic
CustomerID: A unique identifier for each customer (integer).
Genre: The gender of the customer (Male/Female).
Age: The age of the customer (integer).
Annual Income (k$): Annual income of the customer in thousands of dollars (integer).
Spending Score (1-100): A score assigned by the mall based on customer behavior and spending patterns (integer).
πͺ Here is the Full Design
The the above consolidated grouped-barplot depicts the following:
| Features | Central Tendency | Measure of Dispersion |
|---|---|---|
| Age | * Cluster-1: Demography is the oldest (Median β 66 years) * Cluster-7: Demography is the youngest (Median β 23.5 years) |
* Cluster-4, Cluster-6 show high fluctuation in age distribution (twice as much as the rest) |
| Income | * Cluster-1, 3, 5: Exhibit Middle-Class behavioural patterns * Cluster-2, 4: Exhibit Business-Class behavioural patterns * Cluster-6, 7: Exhibit Economy-Class behavioural patterns |
* Cluster-2, 3, 4: Exhibit high fluctuations in income * Cluster-1, 6, 7: Have relatively lower within-group fluctuations |
| Spending | * Cluster-1 (lowest), Cluster-3, Cluster-5: These customer segments have moderate-spending habits * Cluster-4, 6: Spending habit is the lowest (one-third of moderate cluster) * Cluster-2 (highest), Cluster-7: Spending habit is very high |
* Cluster-1, 3, 5: Exhibit comparatively lower fluctuation in spending habits * Cluster-4, 6: Lowest spending with highest fluctuation (C.V. of 82.14%) |
The the above consolidated grouped-barplot depicts the following:
| Clusters β | Direct (+ve) | Indirect (βve) | Special Observation |
|---|---|---|---|
| Income vs. Age | Cluster-4 | Cluster-3, 5, 7 | Cluster-3 [β0.25] |
| Spending vs. Age | Cluster-2, 4, 5 | Cluster-6 | Cluster-6 [β0.28] |
| Spending vs. Income | Cluster-4, 6 | Cluster-5, 7 | Cluster-4 [0.47] |
Note: Correlation values < 0.1 are considered as zero.
-
Cluster-1is mostly uncorrelated with all existing features. Assumably, they will continue to function without any intervention and create business. -
Cluster-6(Lowest Income Group but the Steadiest): With increasing Age, Spending decreases. While with increasing Income, Spending increases. Income and Age seem to be uncorrelated. -
Cluster-4(Business Class but the fluctuating income): Income, Age, and Spending increase alongside. Correlation between Income and Spending is highest. -
Cluster-5, 7(βve income effect): Depicts inverse relationship between Income and Spending, i.e., with increase in income, spending decreases. -
Cluster-2: Exhibits increase in spending with increase in Age.
Log-Log Regression Model Suggest the Following :
- Overall Model Addequacy (
Prob (F-statistic) < 0.05): Hence Null Hypothesis (all coefficients are zero) is rejected. - The featured (Logarithmic Transformation) log-log model can explain 16.5% of the total variation in spending score (
Adj. R-squared = 0.165). Log(Annual Income)has found to be a significant predictor (P>|t| = 0.013).- Final Thought: 1 % increase in
Annual Incomemay lead to 2.42 % increase inAvg. Spending Score.
Business.Recommendation.1.mp4
| Cluster | Demographic & Profile | Insights | Business Recommendation |
|---|---|---|---|
| Cluster-1 | Oldest (Median Age = 66), Middle-Class Income, Lowest Spending | Uncorrelated with all features, Stable | Minimal intervention; focus on retention and elder care offerings |
| Cluster-2 | Younger age, Business-Class Income, Very High Spending | Spending β with Age; Positive correlation with Age | Premium/luxury product offerings; ideal for aspirational targeting |
| Cluster-3 | Middle-Class Income, Moderate Spending, Younger demographic | Income vs Age shows indirect (βve) correlation | Offer mid-range product bundles; observe for future upscaling |
| Cluster-4 | Older age, Business-Class Income, High Spending, High Variability in Age & Income | Strongest +ve correlation between Income & Spending | Priority segment for loyalty programs, exclusives, and high-value campaigns |
| Cluster-5 | Middle-Class Income, Moderate Spending | Income vs Spending shows negative (βve) correlation | Introduce value-based promotions and financial advisory services |
| Cluster-6 | Economy-Class Income, Lowest Spending, High Fluctuations in Age & Spending | Age β β Spending β; Income β β Spending β; Income & Age uncorrelated | Low-cost offerings; monitor for cost sensitivity; stabilize fluctuations |
| Cluster-7 | Youngest (Median Age = 23.5), Economy-Class Income, Very High Spending | Income vs Spending shows negative (βve) correlation | Target for youth-focused campaigns; leverage impulse behavior via digital channels |
| Strategy Area | Recommendation |
|---|---|
| Segmented Marketing | Customize messages per cluster (e.g., luxury to Cluster-4, budget to Cluster-6) |
| Product Offering | Use subscription models, exclusive bundles for high-spenders (Cluster-2, 4, 7) |
| Customer Lifecycle | Engage Cluster-7/2 early for long-term loyalty; retain Cluster-1 with age-appropriate services |
| Risk Minimization | Monitor high fluctuation clusters (4, 6) for churn or income-spending shifts |
| Data-Driven Personalization | Use log-log regression insight: 1% income β β 2.42% spending β; micro-target income tiers |
Note: All recommendations are grounded in the observed descriptive statistical and correlation-based cluster insights. As far as hypothesis testing is concerned, only data related to Cluster-4 is found to be statistically significant.
- Let's assume, we have the data: Age=25, Income=60k$, Spending Score=55
- How can we assign a cluster to the customer (new-observation) ??
π Try It (Minimal User Interface)
Clone this repository to your local machine :
git clone https://github.com/pb319/Segment-Stream.gitSet Up a Virtual Environment :
python3 -m venv env
source env/bin/activate # For Linux/macOSpython -m venv env
env\Scripts\activate # For WindowsInstall Dependencies :
pip install jupyter notebook
jupyter notebook #running jupyter notebook- With the Data in Hand (200 Training Examples/ Sample Size) apply other clustering algorithms like, DBSCAN, Hierarchical Clustering, Gaussian Mixture Model and Compare Using Clustering Evaluation Metrics (Dunn Index, Silhouette Coefficient, etc.)
- Running Anomaly Detection techniques (Isolation Forest, Rule Based) would bring out more insights.
- Consider taking a relatively higher dimensional dataset. Dimensional Reduction techniques (PCA, t-SNE) would help deal with visualization of higher dimensional data.
- Extend the current Cross Sectional data-analysis with Longitudinal / Panel-Data using Time Series Analysis.
-
Alves Gomes, M., & Meisen, T. (2023). A review on customer segmentation methods for personalized customer targeting in e-commerce use cases. Information Systems and e-Business Management, 21(3), 527-570.
-
Kim, S. Y., Jung, T. S., Suh, E. H., & Hwang, H. S. (2006). Customer segmentation and strategy development based on customer lifetime value: A case study. Expert systems with applications, 31(1), 101-107.

