Skip to content

How to: Monitor prediction errors once labels arrive

Use this guide when you have ground-truth labels for both groups and want a direct answer about whether the model is performing worse.

When labels are available, prediction error is often the cleanest signal you can compare.

Why this signal works well

Prediction errors turn model quality into a numeric signal:

  • Brier score measures squared error on the predicted probability
  • Log-loss penalizes confident mistakes more heavily

For both, larger values mean worse predictions, so they work naturally with ss.shift.detect_harm(...).

Setup

This guide uses the HELOC dataset again, but with a stratified random split. Unlike the credit-risk guide, the two groups here come from the same overall population.

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import samesame as ss

fico = fetch_openml(data_id=45554, as_frame=True)
X, y = fico.data, fico.target

y_binary = (y == "Bad").astype(int).values

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_binary,
    test_size=0.30,
    stratify=y_binary,
    random_state=12345,
)

Step 1 - Fit the model and get honest predictions

Use out-of-bag predictions for training so the training-side errors are not artificially optimistic.

rf = RandomForestClassifier(
    n_estimators=500,
    oob_score=True,
    random_state=12345,
    min_samples_leaf=10,
)
rf.fit(X_train, y_train)

train_prob = rf.oob_decision_function_[:, 1]
test_prob = rf.predict_proba(X_test)[:, 1]

Step 2 - Turn predictions into error signals

brier_train = (y_train - train_prob) ** 2
brier_test = (y_test - test_prob) ** 2

eps = 1e-10
train_prob_clipped = np.clip(train_prob, eps, 1 - eps)
test_prob_clipped = np.clip(test_prob, eps, 1 - eps)

logloss_train = -(
    y_train * np.log(train_prob_clipped)
    + (1 - y_train) * np.log(1 - train_prob_clipped)
)
logloss_test = -(
    y_test * np.log(test_prob_clipped)
    + (1 - y_test) * np.log(1 - test_prob_clipped)
)

Step 3 - Test whether errors are worse on the test set

harm_brier = ss.shift.detect_harm(
    source=brier_train,
    target=brier_test,
    direction="higher-is-worse",
)

harm_logloss = ss.shift.detect_harm(
    source=logloss_train,
    target=logloss_test,
    direction="higher-is-worse",
)

print(f"Brier p-value:   {harm_brier.pvalue:.4f}")
print(f"Log-loss p-value:{harm_logloss.pvalue:.4f}")

On this stratified random split, the p-values should not be especially small. That is the expected result when training and test come from the same population.

Interpreting the outcome

  • Small p-values mean the test set contains a disproportionate share of higher-error predictions.
  • Large p-values mean there is not enough evidence that the model performs worse on the test set.

It is common for Brier score and log-loss to tell a similar story here. ss.shift.detect_harm(...) is rank-based, so signals that order observations in a similar way often produce similar results.

Use this guide when labels are available. If they are not, use Monitor predicted credit risk or Monitor model confidence instead.