π° Indian Savings Predictor
Predicting savings behavior from personal finance data β a full ML pipeline covering regression, classification, clustering, and evaluation on 20,000 Indian household records.
π Dataset Overview
Source: Indian Personal Finance and Spending Habits (Kaggle)
Colab Notebook: https://colab.research.google.com/drive/1i6PWAJB2ElNH2T7cooUOY5yoYNgbF_qo?usp=sharing
Size: 20,000 individuals | License: MIT
| Feature Group | Columns |
|---|---|
| Demographics | Age, Occupation, City_Tier, Dependents |
| Income | Monthly_Income |
| Expenses (11) | Rent, Groceries, Transport, Eating_Out, Entertainment, Utilities, Healthcare, Education, Clothing, Loan_Repayment, Savings |
| Target (Regression) | Desired_Savings_Percentage (5β25%, continuous) |
| Target (Classification) | Low / Mid / High saver (quantile binning) |
π Part 2: Exploratory Data Analysis
Research Questions (answered with visuals)
We framed 6 questions before touching the data:
- Does savings ambition differ by occupation?
- How strongly does income predict savings goals?
- Are expense columns informative or redundant?
- Do city tier and occupation interact?
- Are there extreme outliers that need removal?
- Does age moderate the incomeβsavings relationship?
Target Distribution
The target (Desired_Savings_Percentage) is bounded 5β25%, with mean 9.8%, median 8.9%, and a moderate right skew of 1.42. The bimodal shape β a large mass near 7β9% and a smaller peak around 15β18% β hints at two distinct income regimes that later drive our clustering strategy.
Occupation vs Savings (Ridgeline)
Verdict: No difference. All four occupations (Business, Salaried, Self-Employed, Student) share near-identical means (~9.8%) and nearly indistinguishable distributions. Occupation alone does not explain savings ambition β a striking non-finding that redirects attention toward income structure.
Feature Correlation Heatmap
Two insights from the hierarchical heatmap:
- Income dominates β Pearson r = 0.78 with the target
- All expense columns are near-synonyms of income β they correlate ~0.95+ with each other, making them collectively redundant unless transformed into ratios (expense / income)
Income vs Savings (Hexbin)
The hexbin plot reveals a stepped, non-linear relationship β income and savings jump discretely rather than linearly. This motivates the log-income transform and income-tier clustering used in Part 4. Pearson r(log-income, target) = 0.734.
Occupation Γ City Tier Heatmap
No meaningful interaction between occupation and city tier on savings ambition. All cells hover near 9.8%, confirming that neither variable is a useful feature on its own.
Expense Column Distributions
Every expense column has a pronounced right tail (skew 3.8β5.4). Under the IQR rule, thousands of rows would be flagged β but these are real high earners, not data errors. Decision: keep all outliers.
Age Γ Income Tier vs Savings
Age shows no meaningful moderation effect within income tiers. The low/mid/high income groups each remain flat across age bands. Once you condition on income tier, age adds nothing β confirming that the stepped structure is purely income-driven.
βοΈ Part 3: Baseline Model
Goal: Predict Desired_Savings_Percentage from raw features.
Baseline: Linear Regression (raw features)
| Metric | Train | Test |
|---|---|---|
| MAE | 1.79 pp | 1.79 pp |
| RMSE | 2.41 | 2.63 |
| RΒ² | 0.621 | 0.541 |
Residual Diagnostics
The residual vs. predicted plot shows two horizontal bands β the model is systematically underpredicting high savers and overpredicting low ones. This is the signature of a missed non-linearity (the income step) that feature engineering will resolve.
Feature Importance (Baseline)
Income dominates with a standardized coefficient of 3.1, while most expense columns cancel each other out due to collinearity. The baseline sees the right direction but can't capture the stepped structure.
π οΈ Part 4: Feature Engineering
Five layers of transformation applied to 25 raw columns β 52 engineered features:
| Layer | Tool | What it solved |
|---|---|---|
| Binary flags | hand-crafted | Zero-inflation (has_loan, has_education) |
| Log transform | hand-crafted | Linearized incomeβtarget relationship |
| Expense ratios | hand-crafted | Removed income-scale collinearity |
| Polynomial (degree-2) | sklearn | Captured incomeΓexpense interactions |
| PCA (15 components) | sklearn | Compressed redundant expense space |
| K-Means clusters | sklearn | Added savings-persona label as feature |
Clustering: 4 Savings Personas
After testing k=2 (silhouette=0.234, Ξ target = 0.02 pp β useless) and iterating to k=4 (silhouette=0.168, Ξ target = 5.87 pp), we found 4 behaviorally meaningful savings personas:
| Cluster | Profile | Mean Savings Target |
|---|---|---|
| 0 | Low-income, few dependents | ~6.1% |
| 1 | Mid-income, moderate spenders | ~8.5% |
| 2 | High-income, heavy spenders | ~14.7% |
| 3 | High-income, lean spenders | ~11.9% |
We intentionally traded silhouette score for a 36Γ larger target spread β a better business signal for the downstream regression.
π Part 5: Three Improved Models
All three models trained on the 52-feature engineered matrix with the same preprocessing pipeline and train/test split.
Model Comparison
| Model | Test RΒ² | Test RMSE | Ξ RΒ² vs Baseline |
|---|---|---|---|
| Baseline Linear Reg. | 0.541 | 2.63 pp | β |
| Linear Reg. (FE) | 0.711 | 2.18 pp | +31% relative |
| Random Forest | 0.793 | 1.85 pp | +47% relative |
| Gradient Boosting | 0.832 | 1.67 pp | +54% relative |
Feature engineering alone (same algorithm) gave a 31% relative RΒ² gain. Switching to Gradient Boosting added another 17%.
Winning Model β Gradient Boosting Residuals
The residual plot is now structureless β no banding, no funnel. The two-regime structure exposed at baseline has been fully resolved. Residuals are symmetric and nearly homoscedastic across the full prediction range.
Feature Importance (after Engineering)
After engineering, log_income is the dominant feature (as expected), but the cluster label, expense ratios, and PCA components all contribute meaningfully β validating that the 5-layer engineering pipeline added real signal.
π·οΈ Part 7: Regression β Classification
Quantile binning β 3 balanced classes:
| Class | Range | Interpretation |
|---|---|---|
| Low | 5β8 pp | Conservative saver |
| Mid | 8β12 pp | Moderate saver |
| High | 12β25 pp | Ambitious saver |
Class Balance
Each tier holds exactly ~33% of the data across full/train/test splits. No imbalance handling needed.
Precision on the High class matters more. The business use case is a fintech recommending premium investment products to predicted high savers. A False Positive (predicting "High" for a low saver) damages both the user and the provider. A False Negative is a missed opportunity β less critical.
π€ Part 8: Classification Models
Confusion Matrices β Three Models
All three classifiers show the same healthy pattern: errors are boundary mistakes (LowβMid, MidβHigh), never extreme jumps. This is the signature of a well-behaved ordinal classifier.
| Model | Test Accuracy | High Precision | High F1 |
|---|---|---|---|
| Logistic Regression | ~72% | ~0.78 | ~0.72 |
| Random Forest | ~86% | ~0.91 | ~0.87 |
| Gradient Boosting | ~89% | ~0.93 | ~0.90 |
Winner: Gradient Boosting β highest precision on the High class and best overall accuracy.
π Key Takeaways
- Income step structure β log transform + clustering resolved the core non-linearity
- Feature engineering > model choice β FE alone gave +31% RΒ²; best model added +17% more
- Expense ratios > raw expenses β dividing by income removes scale dependence
- Clustering as a feature β k=4 personas added 5.87 pp target spread vs useless k=2
π¦ Repository Contents
| File | Description |
|---|---|
*.ipynb |
Full annotated notebook |
regression_model.pkl |
Gradient Boosting regression pipeline (52 features) |
classification_model.pkl |
Gradient Boosting classification pipeline (3 classes) |
Assignment #2 β Classification, Regression, Clustering, Evaluation | April 2026
- Downloads last month
- -














