Model Card for Telecom Churn Prediction Model
This model predicts whether high-value telecom customers are likely to churn (cancel their service) based on their usage patterns over a 3-month period.
Model Details
Model Description
This is an ensemble of scikit-learn classifiers trained to predict customer churn in the telecom industry. The model focuses specifically on high-value customers (top 70th percentile by recharge amount) and uses behavioral data from months 6, 7, and 8 to predict churn in month 9.
- Developed by: MLOps Pipeline Team
- Shared by: Sam2120
- Model type: Binary classification ensemble (Logistic Regression, Random Forest, Gradient Boosting)
- Language(s) (NLP): en
- License: mit
- Finetuned from model: Not applicable - trained from scratch
Model Sources
- Repository: https://github.com/Sam2120/telecom-churn-mlops
- Paper: Not applicable
- Demo: Hugging Face Spaces integration available
Uses
Direct Use
This model is designed for:
- Customer retention teams to identify at-risk high-value customers
- Marketing teams to target retention campaigns
- Business analysts to understand churn patterns and drivers
- Real-time churn risk scoring via the provided FastAPI endpoint
The model accepts tabular customer data with features like call usage, recharge history, and internet usage, and outputs a churn probability (0-1) and binary prediction.
Downstream Use
- Integration with CRM systems for automated retention workflows
- Customer lifetime value (CLV) modeling by combining churn predictions with revenue data
- Campaign optimization to prioritize high-risk, high-value customers
- Fraud detection for identifying unusual usage patterns
Out-of-Scope Use
This model should NOT be used for:
- Predicting churn for low-value customers (model trained only on high-value segment)
- Long-term churn prediction beyond the 1-month horizon it was trained for
- Different telecom markets without retraining (model behavior may not generalize)
- Individual customer harassment or punitive actions based solely on model predictions
- Credit scoring or loan decisions (not designed for financial risk assessment)
Bias, Risks, and Limitations
Technical Limitations
- Class imbalance: The dataset has a natural churn rate of ~8-15%, which creates challenges in recall optimization
- Temporal leakage risk: Features must be strictly from the "good phase" (months 6-8) to avoid data leakage
- Feature drift: Telecom usage patterns change over time; model performance degrades without periodic retraining
- Missing value sensitivity: Model requires complete feature sets; imputation strategies may introduce bias
Sociotechnical Considerations
- Demographic bias: If training data contains demographic information correlated with protected attributes, the model may inadvertently learn biased patterns
- Surveillance concerns: Predictive models for customer behavior raise privacy considerations
- Feedback loops: Aggressive retention campaigns targeting predicted churners may create self-fulfilling prophecies
Recommendations
- Regular model retraining (monthly/quarterly) to account for changing customer behaviors
- A/B testing of retention interventions to measure true causal impact
- Monitoring for demographic disparities in predictions across customer segments
- Combining model predictions with human judgment for final retention decisions
- Transparent communication with customers about data usage for service improvement
How to Get Started with the Model
Installation
pip install -r requirements.txt
Quick Start
import pickle
import pandas as pd
# Load the model
with open("models/churn_model.pkl", "rb") as f:
model = pickle.load(f)
# Prepare features (see src/feature_engineering.py for expected format)
features = pd.DataFrame({
"total_calls_m6": [100],
"total_calls_m7": [95],
"total_calls_m8": [20],
# ... other features
})
# Predict
prediction = model.predict(features)
probability = model.predict_proba(features)[:, 1]
print(f"Churn Risk: {probability[0]:.2%}")
API Usage
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"features": {"total_calls_m6": 100, ...}}'
Training Details
Training Data
Data Source: Telecom customer usage data spanning 4 months (months 6-9)
Dataset Characteristics:
- Format: Tabular CSV with wide format (one row per customer, columns per month)
- Size: ~30,000 high-value customers (top 70th percentile by recharge amount)
- Features: 30+ features including:
- Usage features: Total calls, duration, incoming/outgoing minutes (months 6-8)
- Recharge features: Recharge amounts and counts (months 6-8)
- Internet features: Data usage and sessions (months 6-8)
- Derived features: Average recharge, usage trends, month-over-month changes
Target Definition:
- Churn = 1 if customer has zero incoming calls AND zero outgoing calls AND zero mobile internet usage in month 9
- Churn = 0 otherwise
- Typical churn rate: ~8-15% among high-value customers
Preprocessing Steps:
- Data validation (column checks, null handling)
- High-value customer identification (70th percentile threshold)
- Churn label definition based on month 9 usage
- Feature engineering (usage ratios, trends, averages)
- Train/test split: 80/20 with stratification by churn label
Training Procedure
Preprocessing
- Feature Engineering:
src/feature_engineering.pycreates rolling averages, month-over-month ratios, and interaction features - SMOTE: Applied for class balancing (sampling_strategy=0.5)
- PCA: Optional dimensionality reduction (95% variance threshold)
- Scaling: StandardScaler for Logistic Regression (embedded in pipeline)
Training Hyperparameters
Logistic Regression:
- C: 1.0
- Penalty: l2
- Solver: lbfgs
- Max iterations: 1000
- Class weight: balanced
Random Forest:
- N estimators: 100
- Max depth: 10
- Min samples split: 5
- Min samples leaf: 2
- Class weight: balanced
Gradient Boosting:
- N estimators: 100
- Max depth: 5
- Learning rate: 0.1
Training regime: fp32 (standard scikit-learn precision)
Speeds, Sizes, Times
- Training time: ~2-5 minutes on standard CPU
- Inference time: <10ms per prediction
- Model size: ~5-15 MB per model (pickled)
- Memory usage: ~500MB during training (30K samples)
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Test set: 20% stratified holdout from original data
- Size: ~6,000 samples (typical)
- Churn rate: Matched to training distribution (~8-15%)
Factors
Evaluation is performed across:
- Overall performance (all high-value customers)
- Cross-validation (5-fold stratified) for robustness
- Individual model comparison (Logistic Regression vs Random Forest vs Gradient Boosting)
Metrics
- AUC-ROC: Primary metric for ranking predictions
- F1-Score: Balance between precision and recall
- Precision: Minimize false alarms (don't waste retention budget)
- Recall: Capture actual churners (don't miss at-risk customers)
- Average Precision: Alternative to AUC for imbalanced data
Results
Performance varies by model architecture. The ensemble selects the best model based on validation AUC.
Typical Performance (Random Forest - usually best):
- AUC: ~0.85-0.92
- F1: ~0.70-0.80
- Precision: ~0.65-0.75
- Recall: ~0.75-0.85
Logistic Regression (most interpretable):
- AUC: ~0.80-0.88
- F1: ~0.65-0.75
Gradient Boosting:
- AUC: ~0.83-0.90
- F1: ~0.68-0.78
Summary
The model achieves strong discriminative performance (AUC > 0.85) suitable for production deployment. Random Forest typically performs best, while Logistic Regression provides the most interpretable results for business stakeholders.
Model Examination
Feature Importance
Top predictive features typically include:
- Month-over-month usage decline (steep drops in calls/data)
- Average recharge amount (lower recharge = higher risk)
- Recent usage patterns (month 8 vs month 6-7 averages)
- Data usage consistency (frequent data users who stop are high risk)
Interpretability
- Logistic Regression: Coefficients indicate feature direction and magnitude
- Random Forest: Feature importance scores available via
sklearn - SHAP/Partial Dependence: Can be computed post-hoc for individual explanations
Environmental Impact
Carbon emissions are minimal given the small dataset and efficient scikit-learn algorithms.
- Hardware Type: Standard CPU (no GPU required)
- Hours used: <0.1 hours per training run
- Cloud Provider: Local/AWS (configurable)
- Compute Region: us-east-1 (configurable)
- Carbon Emitted: <10g CO2eq per training run (estimated)
Technical Specifications
Model Architecture and Objective
Objective: Minimize binary cross-entropy loss for churn prediction
Architecture: Ensemble of three scikit-learn classifiers:
- Logistic Regression (linear baseline)
- Random Forest (tree-based, handles non-linearities)
- Gradient Boosting (sequential trees, high accuracy)
Pipeline:
Input Features โ [Optional SMOTE] โ Classifier โ Churn Probability
Compute Infrastructure
Hardware
- Minimum: 2 CPU cores, 4GB RAM
- Recommended: 4 CPU cores, 8GB RAM
- GPU: Not required (CPU-only training)
Software
- Python: 3.10+
- Key dependencies:
- scikit-learn 1.3+
- pandas 2.0+
- numpy 1.24+
- imbalanced-learn (SMOTE)
- mlflow (experiment tracking)
- fastapi (serving)
Citation
BibTeX:
@software{telecom_churn_mlops,
title = {Telecom Churn Prediction MLOps Pipeline},
author = {Sam2120},
year = {2024},
url = {https://github.com/Sam2120/telecom-churn-mlops}
}
APA: Sam2120. (2024). Telecom Churn Prediction MLOps Pipeline [Software]. GitHub. https://github.com/Sam2120/telecom-churn-mlops
Glossary
- Churn: Customer cancellation of telecom service
- High-Value Customer: Top 70th percentile by average recharge amount in good phase months
- Good Phase: Months 6, 7, 8 (stable period before prediction)
- Action Month: Month 8 (last month with full data before prediction)
- SMOTE: Synthetic Minority Over-sampling Technique (for class balancing)
- AUC: Area Under the ROC Curve (discrimination metric)
- ARPU: Average Revenue Per User
- MOU: Minutes of Usage
More Information
- Full documentation: See
docs/architecture.md - API documentation: Available at
/docswhen running the FastAPI server - MLflow UI: Run
mlflow uito see experiment tracking - DVC pipeline: Run
dvc reproto reproduce the full pipeline
Model Card Authors
- MLOps Pipeline Team
- Sam2120
Model Card Contact
For questions or issues, please open an issue on the GitHub repository: https://github.com/Sam2120/telecom-churn-mlops/issues