samesame¶
Same, same but different ...
samesame helps you compare a source sample with a target sample.
The source is your reference — typically training data or an earlier time period.
The target is what you're comparing against — typically production data or a later period.
It answers two practical questions:
- Did anything change? Use
test_shift(...). - Did things get worse? Use
test_adverse_shift(...).
Use it for model monitoring, data validation, drift assessment, or any workflow where you need to compare two groups and determine whether the difference is practically important.
Who is this for?¶
samesame is useful whenever you need to compare a source group and a target group, for example:
- Model monitoring — Does production data still look like training data?
- Data validation — Does this new batch look like the data I expect?
- Drift detection — Did something change between last month and this month?
- Group comparison — Do two customer groups, regions, or experiments look meaningfully different?
Installation¶
python -m pip install samesame
Quick Start¶
Suppose you already have one score per row for a source sample and a target sample. Larger scores should indicate either worse outcomes or unusual ones. The score usually comes from a (pre-trained) model. For example, you might train a classifier to distinguish between the source and target data, then use the predicted probabilities as scores. Or you might use a model's confidence or prediction errors as scores. The choice of score depends on your application and what kind of shift you want to detect.
import numpy as np
from samesame import test_adverse_shift, test_shift
rng = np.random.default_rng(123_456)
source_scores = rng.normal(size=600)
target_scores = rng.normal(size=600)
shift = test_shift(source=source_scores, target=target_scores)
print(f"Did anything change? p-value = {shift.pvalue:.4f}")
harm = test_adverse_shift(
source=source_scores,
target=target_scores,
direction="higher-is-worse",
)
print(f"Did things get worse? p-value = {harm.pvalue:.4f}")
How to read this: a small p-value from test_shift(...) indicates evidence that the target sample differs from the source sample.
A small p-value from test_adverse_shift(...) indicates evidence that it has also shifted in a worse direction.
If the first is small and the second is large, the data changed but not in a clearly harmful way.
How it works¶
samesame does not compare raw tables directly. The usual workflow is:
- Turn each row into one score — typically from a classifier trained to distinguish the two groups.
- Compare those scores with
test_shift(...)(did anything change?) andtest_adverse_shift(...)(did it get worse?).
Both tests are permutation-based, so no distributional assumptions are required.
When you know that source and target have different feature distributions — covariate shift — you can supply sample importance weights to focus the test on the region where both groups overlap. See Adjust for covariate shift with importance weights.
Where to go next¶
Step-by-step examples are available in the documentation:
Tutorials
How-to guides
- Monitor a credit risk model
- Monitor prediction errors when labels are available
- Monitor model confidence
Dependencies¶
samesame has minimal dependencies. It is built on top of, and fully compatible with,
scikit-learn and numpy.