samesame¶

Same, same but different ...

samesame compares a reference group with a new group and tells you whether the new group looks different, and whether it moved in a worse direction.

In the package, the reference group is called source and the new group is called target. That could mean training vs production data, a baseline batch vs a fresh batch, or one segment vs another.

The package is built around two practical questions:

Did anything change?
Did the change point in a worse direction?

You answer those questions with the signal that matches your use case: predicted risk, model confidence, prediction error, or a classifier score used to compare two datasets.

Start here¶

Start with Detect a distribution shift if you want to know whether two datasets differ at all.
Continue to Check whether a shift is harmful when you know what "worse" means for your signal.
Use Adjust for covariate shift with importance weights when source and target have different feature coverage and you want to focus on their overlap.

Quick example¶

import numpy as np
import samesame as ss

rng = np.random.default_rng(123_456)
source_scores = rng.normal(size=600)
target_scores = rng.normal(size=600)

shift = ss.shift.detect_shift(source_scores, target_scores)
harm = ss.shift.detect_harm(
    source_scores,
    target_scores,
    direction="higher-is-worse",
)

print(f"Shift p-value: {shift.pvalue:.4f}")
print(f"Harm  p-value: {harm.pvalue:.4f}")

A small p-value from detect_shift(...) means the groups differ. A small p-value from detect_harm(...) means the target group also moved in the declared worse direction.

Common signals¶

Choose the signal that matches the decision you need to make:

Predicted risk when higher values already mean higher business risk.
Prediction error when labels are available and you want to measure accuracy directly.
Confidence score when you want to monitor certainty rather than business impact.
Domain-classifier score when your goal is to detect distribution shift between datasets.

The package does not force one interpretation on you. It gives you a small set of tests you can reuse across these settings.

Why it works well in practice¶

samesame is statistically grounded, but the working model is simple:

Build a numeric signal for source and target.
Test for any change with ss.shift.detect_shift(...).
Test for directional harm with ss.shift.detect_harm(...) when direction matters.

Both tests are permutation-based, which keeps the assumptions light. When source and target differ in feature support, ss.weights.from_domain_probabilities(...) lets you focus the test on the region where the two groups are genuinely comparable.

Pick a guide¶

Monitor predicted credit risk for a label-free business-risk workflow.
Monitor model confidence when confidence matters more than the raw prediction.
Monitor prediction errors once labels arrive for direct accuracy monitoring.
Focus harmful-shift testing on shared support when source contains outliers that are irrelevant for deployment.
Restrict testing to common support on both sides when both groups contain low-overlap outliers.

Installation¶

python -m pip install samesame