← Back to Portfolio
Data Science

Understanding Loan Risk with Data Science

By Krishna Singhal 12 min read March 2026

Every time a bank approves a loan, it's making a bet: "Will this person pay us back?" Get it right, and the bank earns interest income. Get it wrong, and the loan defaults — the bank loses the principal, legal costs pile up, and the borrower's credit score tanks. It's a lose-lose.

Traditionally, loan officers made these decisions using simple scorecards and gut instinct. Today, data science allows us to build credit risk models that analyze hundreds of borrower attributes simultaneously, producing probability scores that are far more accurate than human judgment alone.

The Business Problem

Let's frame this precisely. We want to build a binary classification model that predicts:

The stakes are asymmetric. Approving a loan that defaults (false negative) costs the bank thousands. Rejecting a good borrower (false positive) only costs the bank a small amount of potential interest. This asymmetry is critical — it affects how we choose our evaluation metrics and decision thresholds.

The Data: What Banks Look At

A typical loan application dataset contains features across these categories:

Borrower Demographics

Credit History

Loan Characteristics

Key metric — DTI Ratio: Debt-to-Income is calculated as (Monthly Debt Payments / Monthly Gross Income) × 100. A DTI above 40% is generally considered high risk. This single feature is often the most predictive variable in loan default models.

The ML Pipeline: Step by Step

1
Data Cleaning Handle missing values (median imputation for numerical, mode for categorical), remove duplicates, fix data types. Expect 5-15% of real loan data to have missing fields.
2
Exploratory Data Analysis Visualize default rates across income brackets, credit scores, and DTI ranges. Look for non-linear relationships — default rates often spike sharply above certain thresholds rather than increasing gradually.
3
Feature Engineering Create new features: income-to-loan ratio, credit utilization bins, interaction terms (e.g., high DTI + short credit history). Encode categorical variables using one-hot or target encoding.
4
Handling Class Imbalance Loan datasets are heavily imbalanced (typically 85-95% non-default). Apply SMOTE (Synthetic Minority Oversampling) or adjust class weights. Without this step, your model will just predict "non-default" for everything and claim 90% accuracy.
5
Model Training Train multiple models: Logistic Regression (baseline), Random Forest, XGBoost, LightGBM. Use stratified K-fold cross-validation to ensure each fold has the same default ratio.
6
Evaluation & Threshold Tuning Use AUC-ROC (not accuracy!) as the primary metric. Tune the decision threshold using precision-recall curves based on business requirements.

Why Accuracy is the Wrong Metric

This is the most common mistake in credit risk modeling. Consider a dataset where only 8% of loans default:

Dumb Model: Always predict "non-default"
Accuracy: 92% ✔ (looks great!)
Defaults caught: 0% ✘ (completely useless)

Smart Model: XGBoost with threshold tuning
Accuracy: 84% (looks worse)
Defaults caught: 78% ✔ (saves the bank millions)

The right metrics for credit risk are:

Metric What it measures Why it matters
AUC-ROC Model's ability to distinguish defaulters from non-defaulters across all thresholds Single number summarizing overall model quality (0.85+ is good)
Recall (Sensitivity) % of actual defaults correctly identified Missing a default is expensive — high recall is critical
Precision % of predicted defaults that actually defaulted Rejecting too many good borrowers hurts business growth
F1-Score Harmonic mean of precision and recall Balances the trade-off when you can't optimize both
KS Statistic Maximum separation between cumulative distributions of defaults and non-defaults Industry-standard metric used by banks (0.4+ is strong)

Feature Importance: What Predicts Default?

After training an XGBoost model on typical loan data, here are the features ranked by importance:

  1. Credit Score — Single most predictive feature. Borrowers below 650 have 3-5x higher default rates.
  2. Debt-to-Income Ratio — DTI above 40% is a strong default signal.
  3. Revolving Credit Utilization — Using >80% of available credit indicates financial stress.
  4. Number of Past Delinquencies — Past behavior is the best predictor of future behavior.
  5. Loan Amount / Annual Income Ratio — Loans exceeding 30% of annual income are risky.
  6. Length of Credit History — Shorter histories have less data, more uncertainty.
  7. Employment Length — Job stability correlates with repayment ability.
  8. Number of Credit Inquiries — Multiple recent inquiries suggest desperate borrowing.
Insight: The top 3 features alone account for ~60% of the model's predictive power. This aligns with traditional banking wisdom but with a crucial difference: the ML model captures non-linear interactions between these features that simple scorecards miss.

From Model to Business Decision

A trained model outputs a probability (e.g., 0.23 = 23% chance of default). The bank must then decide: at what probability threshold do we reject the loan?

The optimal threshold depends on the cost ratio: how much a default costs vs. how much a rejected good loan costs in lost interest. Banks typically run simulations across different thresholds and choose the one that maximizes expected profit.

Practical Code Outline

import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, classification_report
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

# Load and prepare data
df = pd.read_csv('loan_data.csv')
X = df.drop('default', axis=1)
y = df['default']

# Handle class imbalance
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

# Train with cross-validation
model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=len(y[y==0])/len(y[y==1])
)

cv = StratifiedKFold(n_splits=5, shuffle=True)
for train_idx, val_idx in cv.split(X_res, y_res):
    model.fit(X_res.iloc[train_idx], y_res.iloc[train_idx])
    preds = model.predict_proba(X_res.iloc[val_idx])[:,1]
    auc = roc_auc_score(y_res.iloc[val_idx], preds)
    print(f'Fold AUC: {auc:.4f}')

Ethical Considerations

Credit risk models must be built responsibly:

Key Takeaways

Building a loan risk model is one of the most rewarding data science projects because it sits at the intersection of statistics, business strategy, and real-world impact. A well-built model doesn't just predict defaults — it helps banks serve more customers responsibly, price risk accurately, and allocate capital efficiently.

If you're getting started, I recommend the Lending Club dataset on Kaggle — it has 2M+ loan records with rich feature sets and is the industry standard for credit risk modeling practice.

Data Science Credit Risk XGBoost Python SMOTE Finance Classification