💰 Indian Savings Predictor

Predicting savings behavior from personal finance data — a full ML pipeline covering regression, classification, clustering, and evaluation on 20,000 Indian household records.

📋 Dataset Overview

Source: Indian Personal Finance and Spending Habits (Kaggle)
Colab Notebook: https://colab.research.google.com/drive/1i6PWAJB2ElNH2T7cooUOY5yoYNgbF_qo?usp=sharing Size: 20,000 individuals | License: MIT

Feature Group	Columns
Demographics	Age, Occupation, City_Tier, Dependents
Income	Monthly_Income
Expenses (11)	Rent, Groceries, Transport, Eating_Out, Entertainment, Utilities, Healthcare, Education, Clothing, Loan_Repayment, Savings
Target (Regression)	`Desired_Savings_Percentage` (5–25%, continuous)
Target (Classification)	Low / Mid / High saver (quantile binning)

🔍 Part 2: Exploratory Data Analysis

Research Questions (answered with visuals)

We framed 6 questions before touching the data:

Does savings ambition differ by occupation?
How strongly does income predict savings goals?
Are expense columns informative or redundant?
Do city tier and occupation interact?
Are there extreme outliers that need removal?
Does age moderate the income–savings relationship?

Target Distribution

The target (Desired_Savings_Percentage) is bounded 5–25%, with mean 9.8%, median 8.9%, and a moderate right skew of 1.42. The bimodal shape — a large mass near 7–9% and a smaller peak around 15–18% — hints at two distinct income regimes that later drive our clustering strategy.

Occupation vs Savings (Ridgeline)

Verdict: No difference. All four occupations (Business, Salaried, Self-Employed, Student) share near-identical means (~9.8%) and nearly indistinguishable distributions. Occupation alone does not explain savings ambition — a striking non-finding that redirects attention toward income structure.

Feature Correlation Heatmap

Two insights from the hierarchical heatmap:

Income dominates — Pearson r = 0.78 with the target
All expense columns are near-synonyms of income — they correlate ~0.95+ with each other, making them collectively redundant unless transformed into ratios (expense / income)

Income vs Savings (Hexbin)

The hexbin plot reveals a stepped, non-linear relationship — income and savings jump discretely rather than linearly. This motivates the log-income transform and income-tier clustering used in Part 4. Pearson r(log-income, target) = 0.734.

Occupation × City Tier Heatmap

No meaningful interaction between occupation and city tier on savings ambition. All cells hover near 9.8%, confirming that neither variable is a useful feature on its own.

Expense Column Distributions

Every expense column has a pronounced right tail (skew 3.8–5.4). Under the IQR rule, thousands of rows would be flagged — but these are real high earners, not data errors. Decision: keep all outliers.

Age × Income Tier vs Savings

Age shows no meaningful moderation effect within income tiers. The low/mid/high income groups each remain flat across age bands. Once you condition on income tier, age adds nothing — confirming that the stepped structure is purely income-driven.

⚙️ Part 3: Baseline Model

Goal: Predict Desired_Savings_Percentage from raw features.

Baseline: Linear Regression (raw features)

Metric	Train	Test
MAE	1.79 pp	1.79 pp
RMSE	2.41	2.63
R²	0.621	0.541

Residual Diagnostics

The residual vs. predicted plot shows two horizontal bands — the model is systematically underpredicting high savers and overpredicting low ones. This is the signature of a missed non-linearity (the income step) that feature engineering will resolve.

Feature Importance (Baseline)

Income dominates with a standardized coefficient of 3.1, while most expense columns cancel each other out due to collinearity. The baseline sees the right direction but can't capture the stepped structure.

🛠️ Part 4: Feature Engineering

Five layers of transformation applied to 25 raw columns → 52 engineered features:

Layer	Tool	What it solved
Binary flags	hand-crafted	Zero-inflation (`has_loan`, `has_education`)
Log transform	hand-crafted	Linearized income–target relationship
Expense ratios	hand-crafted	Removed income-scale collinearity
Polynomial (degree-2)	sklearn	Captured income×expense interactions
PCA (15 components)	sklearn	Compressed redundant expense space
K-Means clusters	sklearn	Added savings-persona label as feature

Clustering: 4 Savings Personas

After testing k=2 (silhouette=0.234, Δ target = 0.02 pp — useless) and iterating to k=4 (silhouette=0.168, Δ target = 5.87 pp), we found 4 behaviorally meaningful savings personas:

Cluster	Profile	Mean Savings Target
0	Low-income, few dependents	~6.1%
1	Mid-income, moderate spenders	~8.5%
2	High-income, heavy spenders	~14.7%
3	High-income, lean spenders	~11.9%

We intentionally traded silhouette score for a 36× larger target spread — a better business signal for the downstream regression.

📊 Part 5: Three Improved Models

All three models trained on the 52-feature engineered matrix with the same preprocessing pipeline and train/test split.

Model Comparison

Model	Test R²	Test RMSE	Δ R² vs Baseline
Baseline Linear Reg.	0.541	2.63 pp	—
Linear Reg. (FE)	0.711	2.18 pp	+31% relative
Random Forest	0.793	1.85 pp	+47% relative
Gradient Boosting	0.832	1.67 pp	+54% relative

Feature engineering alone (same algorithm) gave a 31% relative R² gain. Switching to Gradient Boosting added another 17%.

Winning Model — Gradient Boosting Residuals

The residual plot is now structureless — no banding, no funnel. The two-regime structure exposed at baseline has been fully resolved. Residuals are symmetric and nearly homoscedastic across the full prediction range.

Feature Importance (after Engineering)

After engineering, log_income is the dominant feature (as expected), but the cluster label, expense ratios, and PCA components all contribute meaningfully — validating that the 5-layer engineering pipeline added real signal.

🏷️ Part 7: Regression → Classification

Quantile binning → 3 balanced classes:

Class	Range	Interpretation
Low	5–8 pp	Conservative saver
Mid	8–12 pp	Moderate saver
High	12–25 pp	Ambitious saver

Class Balance

Each tier holds exactly ~33% of the data across full/train/test splits. No imbalance handling needed.

Precision on the High class matters more. The business use case is a fintech recommending premium investment products to predicted high savers. A False Positive (predicting "High" for a low saver) damages both the user and the provider. A False Negative is a missed opportunity — less critical.

🤖 Part 8: Classification Models

Confusion Matrices — Three Models

All three classifiers show the same healthy pattern: errors are boundary mistakes (Low↔Mid, Mid↔High), never extreme jumps. This is the signature of a well-behaved ordinal classifier.

Model	Test Accuracy	High Precision	High F1
Logistic Regression	~72%	~0.78	~0.72
Random Forest	~86%	~0.91	~0.87
Gradient Boosting	~89%	~0.93	~0.90

Winner: Gradient Boosting — highest precision on the High class and best overall accuracy.

📌 Key Takeaways

Income step structure — log transform + clustering resolved the core non-linearity
Feature engineering > model choice — FE alone gave +31% R²; best model added +17% more
Expense ratios > raw expenses — dividing by income removes scale dependence
Clustering as a feature — k=4 personas added 5.87 pp target spread vs useless k=2

📦 Repository Contents

File	Description
`*.ipynb`	Full annotated notebook
`regression_model.pkl`	Gradient Boosting regression pipeline (52 features)
`classification_model.pkl`	Gradient Boosting classification pipeline (3 classes)

Assignment #2 — Classification, Regression, Clustering, Evaluation | April 2026

Downloads last month: -