🏦 CreditRiskAI

End-to-end ML system to predict credit default probability — from raw data to a deployed REST API

Live Demo → | API Docs → | Notebook →

🎥 Demo Video

Demo.loan_prediction.mp4

Overview

Financial institutions lose billions annually to loan defaults. This project builds an end-to-end credit risk scoring system that estimates the probability a borrower will default, using gradient boosted trees trained on 32,000+ real loan records.

The system goes beyond a Jupyter notebook — it ships a FastAPI backend deployed on Railway and a Streamlit frontend on Streamlit Cloud, accepting live applicant data and returning a risk score in real time.

Dataset: Credit Risk Dataset – Kaggle | 32,581 records · 12 features · 21.8% default rate

Results

Model	Precision	Recall	F1 Score	ROC-AUC
Logistic Regression	0.73	0.56	0.64	0.867
Random Forest	0.91	0.72	0.82	0.933
XGBoost	0.93	0.72	0.83	0.942

XGBoost selected as final model. Optimised for recall — in credit risk, a missed default (false negative) is costlier than a false alarm.

System Architecture

┌─────────────────────────────────────────────────────────┐
│                     Streamlit Cloud                      │
│   User fills applicant form → POST /predict              │
└──────────────────────┬──────────────────────────────────┘
                       │  JSON payload (12 features)
                       ▼
┌─────────────────────────────────────────────────────────┐
│                  FastAPI  (Railway)                       │
│                                                          │
│   Raw Input → ColumnTransformer → XGBoost.predict_proba  │
│                                                          │
│   Returns: { risk_category, default_probability }        │
└─────────────────────────────────────────────────────────┘

ML Pipeline

1. Data Cleaning

Removed logical impossibilities: rows where person_emp_length > person_age
Dropped rows with simultaneous nulls in loan_int_rate and person_emp_length
Removed outlier: single row with loan_int_rate > 20 + missing employment (data error)

2. Feature Engineering

Feature	Type	Description
`person_emp_length_missing`	Binary flag	Missing employment length is itself predictive — borrowers with unknown employment default at higher rates
`loan_percent_income`	Ratio	`loan_amnt / person_income` — debt-to-income proxy

3. Imputation Strategy

person_emp_length nulls → filled with median
loan_int_rate nulls → filled with within-grade median (preserves grade-level interest rate signal)

4. Preprocessing

OneHotEncoding on 4 categorical columns: person_home_ownership, loan_intent, loan_grade, cb_person_default_on_file
StandardScaler applied only to Logistic Regression (tree models are scale-invariant)
Preprocessing saved as preprocessor.pkl via ColumnTransformer for consistent inference

5. Key EDA Findings

Loan grade is the strongest default predictor — Grade G default rate ~4× Grade A
Loan-to-income ratio separates defaulters cleanly (defaulters avg 0.38 vs 0.16)
RENT ownership defaults more than MORTGAGE or OWN
Debt consolidation and medical loan intents have highest default rates
Missing emp_length → higher default rate (informative missingness)

6. SHAP Explainability

Top features by SHAP importance:

loan_int_rate — higher rate = higher risk (also a proxy for perceived risk)
loan_percent_income — debt burden relative to income
loan_grade — lender's internal risk rating
person_home_ownership — financial stability signal
person_emp_length_missing — informative missingness flag

Project Workflow

flowchart TD
    A[Loan Application Data]
    --> B[Data Preprocessing]

    B --> C[Feature Engineering]

    C --> D[XGBoost Model]

    D --> E[Default Probability]

    E --> F[Risk Classification]

    F --> G[FastAPI Backend]

    G --> H[Streamlit Dashboard]

Project Structure

loan-default-prediction/
├── notebooks/
│   └── loan_default_prediction.ipynb   # EDA, feature engineering, training, SHAP
├── models/
│   ├── xgboost_model.pkl               # Trained XGBoost classifier
│   └── preprocessor.pkl                # Fitted ColumnTransformer
├── Fast_api_app.py                     # FastAPI prediction endpoint
├── streamlit_app.py                    # Streamlit frontend
└── requirements.txt

Project Features

XGBoost Machine Learning Model
Feature Engineering Pipeline
SHAP Explainable AI
FastAPI Backend
Streamlit Frontend
Probability-Based Risk Assessment
Real-Time Predictions
Railway Deployment
Production-Ready Inference Pipeline

API Reference

POST /predict — Returns default probability for a loan applicant

Sample Request:

{
  "person_age": 28,
  "person_income": 60000,
  "person_home_ownership": "RENT",
  "person_emp_length": 3.0,
  "loan_intent": "PERSONAL",
  "loan_grade": "B",
  "loan_amnt": 10000,
  "loan_int_rate": 12.5,
  "loan_percent_income": 0.17,
  "cb_person_default_on_file": "N",
  "cb_person_cred_hist_length": 4.0,
  "person_emp_length_missing": 0
}

Sample Response:

{
  "risk_category": "Low Risk",
  "default_probability": 0.1342
}

GET / — Health check

Tech Stack

Layer	Tool	Purpose
ML Model	XGBoost 3.2	Gradient boosted classifier
Preprocessing	scikit-learn ColumnTransformer	OHE + scaling pipeline
Explainability	SHAP TreeExplainer	Feature importance + waterfall plots
API	FastAPI + Uvicorn	REST inference endpoint
Frontend	Streamlit	Interactive prediction UI
API Deployment	Railway	Cloud hosting for FastAPI
UI Deployment	Streamlit Cloud	Cloud hosting for frontend
Serialization	joblib	Model + preprocessor persistence

Author

Vipul Singh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏦 CreditRiskAI

End-to-end ML system to predict credit default probability — from raw data to a deployed REST API

🎥 Demo Video

Overview

Results

System Architecture

ML Pipeline

1. Data Cleaning

2. Feature Engineering

3. Imputation Strategy

4. Preprocessing

5. Key EDA Findings

6. SHAP Explainability

Project Workflow

Project Structure

Project Features

API Reference

Tech Stack

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Dataset		Dataset
models		models
notebooks		notebooks
.gitignore		.gitignore
Fast_api_app.py		Fast_api_app.py
README.md		README.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

🏦 CreditRiskAI

End-to-end ML system to predict credit default probability — from raw data to a deployed REST API

🎥 Demo Video

Overview

Results

System Architecture

ML Pipeline

1. Data Cleaning

2. Feature Engineering

3. Imputation Strategy

4. Preprocessing

5. Key EDA Findings

6. SHAP Explainability

Project Workflow

Project Structure

Project Features

API Reference

Tech Stack

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages