Skip to content

How to: Monitor a credit risk model

Use this guide when: you have a model in production and want to check whether the new population looks different from the training population, and whether the model is now predicting higher risk.

What you'll do:

  • Check whether deployment data looks different from training data
  • Find which features drive that difference
  • Check whether predicted default risk is higher in deployment
  • Use both results to decide what action to take

Before you start

This guide assumes you have completed both tutorials:


The scenario

You have trained a credit risk model to predict loan default. Your training data came from low-risk customers (good credit history). The model is now deployed and scoring a different population — higher-risk customers.

Two questions arise:

  1. Are the feature distributions different? If the new customers look nothing like the training data, the model may not be reliable on them.
  2. Has predicted risk shifted adversely? Even if features differ, the model might still generalise. The more relevant question is whether it is now assigning higher default risk to the deployment population.

We answer both questions with test_shift(...) (question 1) and test_adverse_shift(...) (question 2). If you want to monitor model confidence instead of predicted risk, continue to Monitor model confidence after completing this guide.


Setup

We use the HELOC dataset (FICO Explainable AI Challenge), which contains credit bureau features for real customers. We simulate a production deployment scenario by splitting on ExternalRiskEstimate:

  • Training set (ExternalRiskEstimate > 63): 7,683 low-risk customers
  • Deployment set (ExternalRiskEstimate ≤ 63): 2,188 high-risk customers
import re
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from samesame import test_adverse_shift, test_shift

# Download the HELOC dataset (requires internet access on first run)
fico = fetch_openml(data_id=45554, as_frame=True)
X, y = fico.data, fico.target

# Split into training (low-risk) and deployment (high-risk) populations
re_obj = re.compile(r"external.*risk.*estimate", flags=re.I)
col_split = next((c for c in X.columns if re_obj.search(c)), None)
mask_high = X[col_split].astype(float) > 63

X_train = X[mask_high].reset_index(drop=True)
y_train = y[mask_high].reset_index(drop=True)
X_test  = X[~mask_high].reset_index(drop=True)
y_test  = y[~mask_high].reset_index(drop=True)

print(f"Training set:    {len(X_train)} samples")
print(f"Deployment set:  {len(X_test)} samples")

Output:

Training set:    7683 samples
Deployment set:  2188 samples

Step 1 — Detect dataset shift

Question: Are the feature distributions of the training and deployment sets different?

This is a classifier two-sample test: a Random Forest is trained to distinguish training from deployment samples, and AUC measures how easily they can be separated. OOB predictions ensure each row is scored by trees that did not train on it, preventing optimistic bias:

# Label the two populations: 0 = training, 1 = deployment
split = pd.Series([0] * len(X_train) + [1] * len(X_test))
X_concat = pd.concat([X_train, X_test], ignore_index=True)

# Train a classifier to distinguish training from deployment
rf_domain = RandomForestClassifier(
    n_estimators=500,
    oob_score=True,
    random_state=12345,
    min_samples_leaf=10,
)
rf_domain.fit(X_concat, split)
oob_scores = rf_domain.oob_decision_function_[:, 1]  # P(deployment)

# Run the shift test
shift = test_shift(
    source=oob_scores[split.values == 0],
    target=oob_scores[split.values == 1],
)
print(f"AUC statistic: {shift.statistic:.4f}")
print(f"p-value:       {shift.pvalue:.4f}")

Output:

AUC statistic: 1.0000
p-value:       0.0002

AUC is a separation measure. An AUC of 1.0 means the classifier perfectly separates the two populations. The p-value of 0.0002 provides strong evidence against the null hypothesis of no shift — there is strong evidence of dataset shift.

Which features are driving the shift?

Feature importances from the same classifier tell you which features differ most between the two populations:

feat_imp = (
    pd.Series(rf_domain.feature_importances_, index=X_concat.columns)
    .sort_values(ascending=False)
)
print("Top 5 features driving the shift:")
print(feat_imp.head(5))

Output:

Top 5 features driving the shift:
ExternalRiskEstimate          0.642400
MSinceMostRecentDelq          0.069394
MaxDelq2PublicRecLast12M      0.064526
NetFractionRevolvingBurden    0.050656
PercentTradesNeverDelq        0.042478

ExternalRiskEstimate dominates because it was used to create the split — that is expected. Several other features (MSinceMostRecentDelq, MaxDelq2PublicRecLast12M) also differ between the groups, which suggests correlated structure beyond the variable used to define the split.


Step 2 — Test for adverse risk shift

Question: Is the model assigning systematically higher default risk to the deployment population?

Even though the feature distributions are different, the model might still generalise. We now check whether the model's predicted default probabilities are higher (worse) for deployment samples than for training samples.

We train a credit risk model on the training set and compare its predictions on both populations. For the training set, we again use out-of-bag predictions so the scores stay fair:

# Train a credit risk model to predict loan default
loan_status = y_train.map({'Good': 0, 'Bad': 1}).values
rf_bad = RandomForestClassifier(
    n_estimators=500,
    oob_score=True,
    random_state=12345,
    min_samples_leaf=10,
)
rf_bad.fit(X_train, loan_status)

# OOB predictions for training (unbiased); standard predictions for deployment
bad_train = rf_bad.oob_decision_function_[:, 1].ravel()
bad_test  = rf_bad.predict_proba(X_test)[:, 1].ravel()

harm = test_adverse_shift(
    source=bad_train,
    target=bad_test,
    direction="higher-is-worse",
)
print(f"Statistic: {harm.statistic:.4f}")
print(f"p-value:   {harm.pvalue:.4f}")

Output:

Statistic: 0.2483
p-value:   0.0001

A higher statistic means that more of the largest values are concentrated in the deployment set.

p = 0.0001 — strong evidence of adverse shift. The model is assigning substantially higher default risk to deployment samples. This confirms not only that the data is different, but that the shift is adverse with respect to the score being monitored.

This is a good example of when the model output itself is already meaningful. A higher predicted default probability is directly interpretable as higher business risk, so it is a natural score to monitor. When a model output is not directly interpretable as "worse", you need a different score, such as a confidence score. See Monitor model confidence.

The important limitation is the reverse: a confidence score is not a substitute for business impact. A model can become more confident in its predictions while those predictions become more harmful to the business. When you already have a value with direct business meaning, such as default probability, that value should remain the primary monitoring signal.


Step 3 — Interpret the combined results

Running both tests together gives a richer picture than either test alone:

Scenario Recommended action
Both shift and adverse-shift significant Data and outcomes have shifted. Retrain or recalibrate the model.
Only shift significant Data looks different, but outcomes haven't shifted. Monitor closely.
Only adverse-shift significant Outcome shift without feature change (concept drift). Investigate root causes.
Neither significant No evidence of a problem. Continue as normal.

In this example, both tests are significant — the deployment population is different and predicted risk is higher. The recommended action is to retrain or recalibrate the model for the new population.


Summary

  • Shift testing detects whether feature distributions differ between training and deployment. Feature importances help identify which features are responsible.
  • Adverse-shift testing detects whether the model's predictions have shifted adversely. It does not require ground truth labels, making it practical for production monitoring before labels arrive.
  • Use both tests together for a complete picture: test_shift(...) tells you what changed, and test_adverse_shift(...) tells you whether it matters.
  • In this example, predicted risk increased, but in the companion how-to guide, model confidence did not worsen. Those are different signals and both are worth monitoring.
  • If labels are available for the test set, per-sample prediction errors (Brier score, log-loss) provide a direct measure of model accuracy; see Monitor prediction errors.
  • If your model output is not itself a meaningful risk value, use a confidence score instead; see Monitor model confidence.