πŸ’° Indian Savings Predictor

Predicting savings behavior from personal finance data β€” a full ML pipeline covering regression, classification, clustering, and evaluation on 20,000 Indian household records.


πŸ“‹ Dataset Overview

Source: Indian Personal Finance and Spending Habits (Kaggle)
Colab Notebook: https://colab.research.google.com/drive/1i6PWAJB2ElNH2T7cooUOY5yoYNgbF_qo?usp=sharing Size: 20,000 individuals | License: MIT

Feature Group Columns
Demographics Age, Occupation, City_Tier, Dependents
Income Monthly_Income
Expenses (11) Rent, Groceries, Transport, Eating_Out, Entertainment, Utilities, Healthcare, Education, Clothing, Loan_Repayment, Savings
Target (Regression) Desired_Savings_Percentage (5–25%, continuous)
Target (Classification) Low / Mid / High saver (quantile binning)

πŸ” Part 2: Exploratory Data Analysis

Research Questions (answered with visuals)

We framed 6 questions before touching the data:

  1. Does savings ambition differ by occupation?
  2. How strongly does income predict savings goals?
  3. Are expense columns informative or redundant?
  4. Do city tier and occupation interact?
  5. Are there extreme outliers that need removal?
  6. Does age moderate the income–savings relationship?

Target Distribution

image

The target (Desired_Savings_Percentage) is bounded 5–25%, with mean 9.8%, median 8.9%, and a moderate right skew of 1.42. The bimodal shape β€” a large mass near 7–9% and a smaller peak around 15–18% β€” hints at two distinct income regimes that later drive our clustering strategy.


Occupation vs Savings (Ridgeline)

image

Verdict: No difference. All four occupations (Business, Salaried, Self-Employed, Student) share near-identical means (~9.8%) and nearly indistinguishable distributions. Occupation alone does not explain savings ambition β€” a striking non-finding that redirects attention toward income structure.


Feature Correlation Heatmap

image

Two insights from the hierarchical heatmap:

  • Income dominates β€” Pearson r = 0.78 with the target
  • All expense columns are near-synonyms of income β€” they correlate ~0.95+ with each other, making them collectively redundant unless transformed into ratios (expense / income)

Income vs Savings (Hexbin)

image

The hexbin plot reveals a stepped, non-linear relationship β€” income and savings jump discretely rather than linearly. This motivates the log-income transform and income-tier clustering used in Part 4. Pearson r(log-income, target) = 0.734.


Occupation Γ— City Tier Heatmap

image

No meaningful interaction between occupation and city tier on savings ambition. All cells hover near 9.8%, confirming that neither variable is a useful feature on its own.


Expense Column Distributions

image

Every expense column has a pronounced right tail (skew 3.8–5.4). Under the IQR rule, thousands of rows would be flagged β€” but these are real high earners, not data errors. Decision: keep all outliers.


Age Γ— Income Tier vs Savings

image

Age shows no meaningful moderation effect within income tiers. The low/mid/high income groups each remain flat across age bands. Once you condition on income tier, age adds nothing β€” confirming that the stepped structure is purely income-driven.


βš™οΈ Part 3: Baseline Model

Goal: Predict Desired_Savings_Percentage from raw features.

Baseline: Linear Regression (raw features)

Metric Train Test
MAE 1.79 pp 1.79 pp
RMSE 2.41 2.63
RΒ² 0.621 0.541

Residual Diagnostics

image

The residual vs. predicted plot shows two horizontal bands β€” the model is systematically underpredicting high savers and overpredicting low ones. This is the signature of a missed non-linearity (the income step) that feature engineering will resolve.

Feature Importance (Baseline)

image

Income dominates with a standardized coefficient of 3.1, while most expense columns cancel each other out due to collinearity. The baseline sees the right direction but can't capture the stepped structure.


πŸ› οΈ Part 4: Feature Engineering

Five layers of transformation applied to 25 raw columns β†’ 52 engineered features:

Layer Tool What it solved
Binary flags hand-crafted Zero-inflation (has_loan, has_education)
Log transform hand-crafted Linearized income–target relationship
Expense ratios hand-crafted Removed income-scale collinearity
Polynomial (degree-2) sklearn Captured incomeΓ—expense interactions
PCA (15 components) sklearn Compressed redundant expense space
K-Means clusters sklearn Added savings-persona label as feature

Clustering: 4 Savings Personas

image

After testing k=2 (silhouette=0.234, Ξ” target = 0.02 pp β€” useless) and iterating to k=4 (silhouette=0.168, Ξ” target = 5.87 pp), we found 4 behaviorally meaningful savings personas:

Cluster Profile Mean Savings Target
0 Low-income, few dependents ~6.1%
1 Mid-income, moderate spenders ~8.5%
2 High-income, heavy spenders ~14.7%
3 High-income, lean spenders ~11.9%

We intentionally traded silhouette score for a 36Γ— larger target spread β€” a better business signal for the downstream regression.


πŸ“Š Part 5: Three Improved Models

All three models trained on the 52-feature engineered matrix with the same preprocessing pipeline and train/test split.

Model Comparison

image

Model Test RΒ² Test RMSE Ξ” RΒ² vs Baseline
Baseline Linear Reg. 0.541 2.63 pp β€”
Linear Reg. (FE) 0.711 2.18 pp +31% relative
Random Forest 0.793 1.85 pp +47% relative
Gradient Boosting 0.832 1.67 pp +54% relative

Feature engineering alone (same algorithm) gave a 31% relative RΒ² gain. Switching to Gradient Boosting added another 17%.

Winning Model β€” Gradient Boosting Residuals

image

The residual plot is now structureless β€” no banding, no funnel. The two-regime structure exposed at baseline has been fully resolved. Residuals are symmetric and nearly homoscedastic across the full prediction range.

Feature Importance (after Engineering)

image

After engineering, log_income is the dominant feature (as expected), but the cluster label, expense ratios, and PCA components all contribute meaningfully β€” validating that the 5-layer engineering pipeline added real signal.


🏷️ Part 7: Regression β†’ Classification

Quantile binning β†’ 3 balanced classes:

Class Range Interpretation
Low 5–8 pp Conservative saver
Mid 8–12 pp Moderate saver
High 12–25 pp Ambitious saver

Class Balance

image

Each tier holds exactly ~33% of the data across full/train/test splits. No imbalance handling needed.

Precision on the High class matters more. The business use case is a fintech recommending premium investment products to predicted high savers. A False Positive (predicting "High" for a low saver) damages both the user and the provider. A False Negative is a missed opportunity β€” less critical.


πŸ€– Part 8: Classification Models

Confusion Matrices β€” Three Models

image

All three classifiers show the same healthy pattern: errors are boundary mistakes (Low↔Mid, Mid↔High), never extreme jumps. This is the signature of a well-behaved ordinal classifier.

Model Test Accuracy High Precision High F1
Logistic Regression ~72% ~0.78 ~0.72
Random Forest ~86% ~0.91 ~0.87
Gradient Boosting ~89% ~0.93 ~0.90

Winner: Gradient Boosting β€” highest precision on the High class and best overall accuracy.


πŸ“Œ Key Takeaways

  1. Income step structure β€” log transform + clustering resolved the core non-linearity
  2. Feature engineering > model choice β€” FE alone gave +31% RΒ²; best model added +17% more
  3. Expense ratios > raw expenses β€” dividing by income removes scale dependence
  4. Clustering as a feature β€” k=4 personas added 5.87 pp target spread vs useless k=2

πŸ“¦ Repository Contents

File Description
*.ipynb Full annotated notebook
regression_model.pkl Gradient Boosting regression pipeline (52 features)
classification_model.pkl Gradient Boosting classification pipeline (3 classes)

Assignment #2 β€” Classification, Regression, Clustering, Evaluation | April 2026

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support