Skip to content

How to: Monitor predicted credit risk

Use this guide when the model output already has clear business meaning and you want to know two things: whether deployment looks different from training, and whether predicted default risk is higher in deployment.

If you are new to samesame, start with the two tutorials first. This guide assumes basic familiarity with fitting a scikit-learn classifier and calling predict_proba(...).

Why this signal works well

Predicted default probability is already meaningful. Larger values are directly worse, so it is a natural signal for ss.shift.detect_harm(...).

This guide uses the HELOC dataset and simulates deployment by training on lower-risk customers and testing on higher-risk customers.

Setup

import re

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier

import samesame as ss

fico = fetch_openml(data_id=45554, as_frame=True)
X, y = fico.data, fico.target

re_obj = re.compile(r"external.*risk.*estimate", flags=re.I)
col_split = next((c for c in X.columns if re_obj.search(c)), None)
mask_high = X[col_split].astype(float) > 63

X_train = X[mask_high].reset_index(drop=True)
y_train = y[mask_high].reset_index(drop=True)
X_deployment = X[~mask_high].reset_index(drop=True)

print(f"Training set:   {len(X_train)} samples")
print(f"Deployment set: {len(X_deployment)} samples")

Step 1 - Check whether deployment looks different

Train a domain classifier to distinguish training from deployment. Use out-of-bag predictions so each training observation is scored by trees that did not train on it.

split = pd.Series([0] * len(X_train) + [1] * len(X_deployment))
X_concat = pd.concat([X_train, X_deployment], ignore_index=True)

rf_domain = RandomForestClassifier(
    n_estimators=500,
    oob_score=True,
    random_state=12345,
    min_samples_leaf=10,
)
rf_domain.fit(X_concat, split)
domain_scores = rf_domain.oob_decision_function_[:, 1]

shift = ss.shift.detect_shift(
    source=domain_scores[split.values == 0],
    target=domain_scores[split.values == 1],
)

print(f"AUC statistic: {shift.statistic:.4f}")
print(f"p-value:       {shift.pvalue:.4f}")

On this split, you should see an AUC close to 1.0 and a very small p-value, which means the deployment population looks clearly different from training.

If you want a quick diagnostic on what changed, inspect the same classifier's feature importances:

feature_importance = (
    pd.Series(rf_domain.feature_importances_, index=X_concat.columns)
    .sort_values(ascending=False)
)

print(feature_importance.head(5))

Step 2 - Check whether predicted risk moved up

Now train the actual credit model on the training set. Use out-of-bag predictions for training and standard predictions for deployment.

loan_status = y_train.map({"Good": 0, "Bad": 1}).values

rf_bad = RandomForestClassifier(
    n_estimators=500,
    oob_score=True,
    random_state=12345,
    min_samples_leaf=10,
)
rf_bad.fit(X_train, loan_status)

train_risk = rf_bad.oob_decision_function_[:, 1].ravel()
deployment_risk = rf_bad.predict_proba(X_deployment)[:, 1].ravel()

harm = ss.shift.detect_harm(
    source=train_risk,
    target=deployment_risk,
    direction="higher-is-worse",
)

print(f"Statistic: {harm.statistic:.4f}")
print(f"p-value:   {harm.pvalue:.4f}")

On this split, you should again see a very small p-value. That means deployment is not only different, but also riskier according to the model.

Step 3 - Decide what to do

Using both tests together gives a clearer picture than either test alone:

Result pattern What it usually means
shift small, harm small the population changed and predicted risk worsened
shift small, harm large the population changed, but not in a clearly harmful way
shift large, harm small rare, but worth investigating as a direct outcome issue
shift large, harm large no clear evidence of a problem

In this HELOC example, both results are strong. That is a good signal to retrain, recalibrate, or otherwise revisit the deployment policy for the new population.

If you want a separate confidence view, see Monitor model confidence. If labels are available, see Monitor prediction errors once labels arrive.