Skip to content

Testing functions

Use this page for the two primary user-facing tests: test_shift(...) and test_adverse_shift(...). Start here if you are new to the package or want the simplest API surface.

What you get back

  • test_shift(...) returns ShiftDetails with .statistic, .pvalue, .statistic_name, and .null_distribution
  • test_adverse_shift(...) returns AdverseShiftDetails with .statistic, .pvalue, .direction, and .null_distribution

For Bayesian output or advanced controls, see the advanced page.

Task-first API for hypothesis tests over outlier scores.

The primary API exposes:

  • test_shift — test whether two outlier score distributions differ
  • test_adverse_shift — test for harmful shifts with explicit direction
  • adverse_shift_posterior — Bayesian evidence layer on top of an adverse-shift result

All test functions return a full result including the null distribution.

AdverseShiftDetails dataclass

Bases: TestResult

Result of an adverse-shift test, including the full null distribution.

Source code in src/samesame/_types.py
39
40
41
42
43
44
@dataclass(frozen=True)
class AdverseShiftDetails(TestResult):
    """Result of an adverse-shift test, including the full null distribution."""

    direction: Direction
    null_distribution: NDArray[np.float64]

BayesianEvidence dataclass

Bayesian evidence layer computed on top of an adverse-shift result.

Source code in src/samesame/_types.py
47
48
49
50
51
52
@dataclass(frozen=True)
class BayesianEvidence:
    """Bayesian evidence layer computed on top of an adverse-shift result."""

    posterior: NDArray[np.float64]
    bayes_factor: float

ContextualWeights dataclass

Importance weights for source and target groups, used to correct for covariate shift between source and target during a shift test.

Attributes:

Name Type Description
source NDArray[float64]

Importance weights for source samples, normalized to sum to len(source).

target NDArray[float64]

Importance weights for target samples, normalized to sum to len(target).

Source code in src/samesame/weights.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@dataclass(frozen=True)
class ContextualWeights:
    """Importance weights for source and target groups, used to
    correct for covariate shift between source and target during a shift test.

    Attributes
    ----------
    source : NDArray[np.float64]
        Importance weights for source samples, normalized to sum to
        ``len(source)``.
    target : NDArray[np.float64]
        Importance weights for target samples, normalized to sum to
        ``len(target)``.
    """

    source: NDArray[np.float64]
    target: NDArray[np.float64]

ShiftDetails dataclass

Bases: TestResult

Result of a shift test, including the full null distribution.

Source code in src/samesame/_types.py
31
32
33
34
35
36
@dataclass(frozen=True)
class ShiftDetails(TestResult):
    """Result of a shift test, including the full null distribution."""

    statistic_name: str
    null_distribution: NDArray[np.float64]

TestResult dataclass

Shared fields for all test results.

Source code in src/samesame/_types.py
23
24
25
26
27
28
@dataclass(frozen=True)
class TestResult:
    """Shared fields for all test results."""

    statistic: float
    pvalue: float

adverse_shift_posterior(*, source, target, direction, n_resamples=9999, rng=None, weights=None, threshold=1 / 12)

Compute Bayesian evidence for adverse shift using a bootstrap posterior.

Provides a Bayesian evidence layer on top of the adverse-shift test: runs a Bayesian bootstrap over the WAUC metric and returns posterior draws together with a Bayes factor against a reference threshold.

Parameters:

Name Type Description Default
source ArrayLike

Baseline outlier scores, typically from training or reference data.

required
target ArrayLike

New outlier scores to compare against source, typically from production or deployment data.

required
direction ('higher-is-worse', 'higher-is-better')

Whether higher outlier scores indicate worse outcomes ('higher-is-worse') or better outcomes ('higher-is-better'). Required to determine the direction of adverse shift.

'higher-is-worse'
n_resamples int

Number of Bayesian bootstrap resamples, by default 9999.

9999
rng Generator or None

Random number generator for reproducibility. None creates a fresh one.

None
weights ContextualWeights or None

Importance weights to correct for covariate shift and related concerns between source and target. Build from domain probabilities using :func:~samesame.weights.contextual_weights, or construct ContextualWeights(source=..., target=...) directly. Pass None (default) to run an unweighted test.

None
threshold float

WAUC value used as the null reference for the Bayes factor. Defaults to 1/12, the asymptotic expected WAUC under the null hypothesis that source and target are from the same distribution.

1 / 12

Returns:

Type Description
BayesianEvidence

Immutable result with posterior draws and bayes_factor.

See Also

test_adverse_shift : Run the permutation test for adverse shift.

Source code in src/samesame/_api.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
def adverse_shift_posterior(
    *,
    source: ArrayLike,
    target: ArrayLike,
    direction: Direction,
    n_resamples: int = 9999,
    rng: np.random.Generator | None = None,
    weights: ContextualWeights | None = None,
    threshold: float = 1 / 12,
) -> BayesianEvidence:
    """Compute Bayesian evidence for adverse shift using a bootstrap posterior.

    Provides a Bayesian evidence layer on top of the adverse-shift test:
    runs a Bayesian bootstrap over the WAUC metric and returns posterior
    draws together with a Bayes factor against a reference threshold.

    Parameters
    ----------
    source : ArrayLike
        Baseline outlier scores, typically from training or reference data.
    target : ArrayLike
        New outlier scores to compare against ``source``, typically from
        production or deployment data.
    direction : {'higher-is-worse', 'higher-is-better'}
        Whether higher outlier scores indicate worse outcomes
        (``'higher-is-worse'``) or better outcomes (``'higher-is-better'``).
        Required to determine the direction of adverse shift.
    n_resamples : int, optional
        Number of Bayesian bootstrap resamples, by default ``9999``.
    rng : numpy.random.Generator or None, optional
        Random number generator for reproducibility. ``None`` creates a
        fresh one.
    weights : ContextualWeights or None, optional
        Importance weights to correct for covariate shift and related concerns
        between source and target. Build from domain probabilities using
        :func:`~samesame.weights.contextual_weights`, or construct
        ``ContextualWeights(source=..., target=...)`` directly.
        Pass ``None`` (default) to run an unweighted test.
    threshold : float, optional
        WAUC value used as the null reference for the Bayes factor.
        Defaults to ``1/12``, the asymptotic expected WAUC under the null
        hypothesis that source and target are from the same distribution.

    Returns
    -------
    BayesianEvidence
        Immutable result with ``posterior`` draws and ``bayes_factor``.

    See Also
    --------
    test_adverse_shift : Run the permutation test for adverse shift.
    """
    dataset = build_two_sample_dataset(source, target)
    actual, predicted = dataset.labels, dataset.scores
    validated_direction = validate_direction(direction)
    if validated_direction == "higher-is-better":
        predicted = -predicted
    effective_weight = _resolve_weights(weights, dataset.n_source, dataset.n_target)
    posterior = np.asarray(
        bayesian_posterior(
            actual,
            predicted,
            wauc,
            n_resamples=n_resamples,
            rng=rng,
            base_weight=effective_weight,
        ),
        dtype=np.float64,
    )
    bayes_factor_val = float(_bayes_factor(posterior, threshold))
    return BayesianEvidence(
        posterior=posterior,
        bayes_factor=bayes_factor_val,
    )

test_adverse_shift(*, source, target, direction, n_resamples=9999, batch=None, rng=None, weights=None)

Test whether the target sample is harmfully shifted.

Parameters:

Name Type Description Default
source ArrayLike

Baseline outlier scores, typically from training or reference data.

required
target ArrayLike

New outlier scores to compare against source, typically from production or deployment data.

required
direction ('higher-is-worse', 'higher-is-better')

Whether higher outlier scores indicate worse outcomes ('higher-is-worse') or better outcomes ('higher-is-better'). Required to determine the direction of adverse shift.

'higher-is-worse'
n_resamples int

Number of permutation resamples, by default 9999.

9999
batch int or None

Number of resamples to process per batch. None uses a single batch.

None
rng Generator or None

Random number generator for reproducibility. None creates a fresh one.

None
weights ContextualWeights or None

Importance weights to correct for covariate shift and related concerns between source and target. Build from domain probabilities using :func:~samesame.weights.contextual_weights, or construct ContextualWeights(source=..., target=...) directly. Pass None (default) to run an unweighted test.

None

Returns:

Type Description
AdverseShiftDetails

Immutable result with statistic, pvalue, direction, and null_distribution.

See Also

adverse_shift_posterior : Compute Bayesian evidence on top of this result.

Source code in src/samesame/_api.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
def test_adverse_shift(
    *,
    source: ArrayLike,
    target: ArrayLike,
    direction: Direction,
    n_resamples: int = 9999,
    batch: int | None = None,
    rng: np.random.Generator | None = None,
    weights: ContextualWeights | None = None,
) -> AdverseShiftDetails:
    """Test whether the target sample is harmfully shifted.

    Parameters
    ----------
    source : ArrayLike
        Baseline outlier scores, typically from training or reference data.
    target : ArrayLike
        New outlier scores to compare against ``source``, typically from
        production or deployment data.
    direction : {'higher-is-worse', 'higher-is-better'}
        Whether higher outlier scores indicate worse outcomes
        (``'higher-is-worse'``) or better outcomes (``'higher-is-better'``).
        Required to determine the direction of adverse shift.
    n_resamples : int, optional
        Number of permutation resamples, by default ``9999``.
    batch : int or None, optional
        Number of resamples to process per batch. ``None`` uses a single
        batch.
    rng : numpy.random.Generator or None, optional
        Random number generator for reproducibility. ``None`` creates a
        fresh one.
    weights : ContextualWeights or None, optional
        Importance weights to correct for covariate shift and related concerns
        between source and target. Build from domain probabilities using
        :func:`~samesame.weights.contextual_weights`, or construct
        ``ContextualWeights(source=..., target=...)`` directly.
        Pass ``None`` (default) to run an unweighted test.

    Returns
    -------
    AdverseShiftDetails
        Immutable result with ``statistic``, ``pvalue``, ``direction``,
        and ``null_distribution``.

    See Also
    --------
    adverse_shift_posterior : Compute Bayesian evidence on top of this result.
    """
    dataset = build_two_sample_dataset(source, target)
    actual, predicted = dataset.labels, dataset.scores
    validated_direction = validate_direction(direction)
    if validated_direction == "higher-is-better":
        predicted = -predicted
    effective_weight = _resolve_weights(weights, dataset.n_source, dataset.n_target)
    result = _run_permutation_test(
        actual,
        predicted,
        wauc,
        n_resamples=n_resamples,
        alternative="greater",
        sample_weight=effective_weight,
        rng=rng,
        batch=batch,
    )
    return AdverseShiftDetails(
        statistic=float(result.statistic),
        pvalue=float(result.pvalue),
        direction=validated_direction,
        null_distribution=np.asarray(result.null_distribution, dtype=np.float64),
    )

test_shift(*, source, target, statistic='roc_auc', alternative='two-sided', n_resamples=9999, batch=None, rng=None, weights=None)

Test whether the source and target outlier score distributions differ.

Parameters:

Name Type Description Default
source ArrayLike

Baseline outlier scores, typically from training or reference data.

required
target ArrayLike

New outlier scores to compare against source, typically from production or deployment data.

required
statistic ('roc_auc', 'balanced_accuracy', 'matthews_corrcoef')

Named built-in statistic used inside the permutation test.

'roc_auc'
alternative ('two-sided', 'less', 'greater')

Alternative hypothesis for the permutation test, by default 'two-sided'.

'two-sided'
n_resamples int

Number of permutation resamples, by default 9999.

9999
batch int or None

Number of resamples to process per batch. None uses a single batch.

None
rng Generator or None

Random number generator for reproducibility. None creates a fresh one.

None
weights ContextualWeights or None

Importance weights to correct for covariate shift and related concerns between source and target. Build from domain probabilities using :func:~samesame.weights.contextual_weights, or construct ContextualWeights(source=..., target=...) directly. Pass None (default) to run an unweighted test.

None

Returns:

Type Description
ShiftDetails

Immutable result with statistic, pvalue, statistic_name, and null_distribution.

Source code in src/samesame/_api.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def test_shift(
    *,
    source: ArrayLike,
    target: ArrayLike,
    statistic: ShiftStatistic = "roc_auc",
    alternative: Literal["less", "greater", "two-sided"] = "two-sided",
    n_resamples: int = 9999,
    batch: int | None = None,
    rng: np.random.Generator | None = None,
    weights: ContextualWeights | None = None,
) -> ShiftDetails:
    """Test whether the source and target outlier score distributions differ.

    Parameters
    ----------
    source : ArrayLike
        Baseline outlier scores, typically from training or reference data.
    target : ArrayLike
        New outlier scores to compare against ``source``, typically from
        production or deployment data.
    statistic : {'roc_auc', 'balanced_accuracy', 'matthews_corrcoef'}, optional
        Named built-in statistic used inside the permutation test.
    alternative : {'two-sided', 'less', 'greater'}, optional
        Alternative hypothesis for the permutation test, by default
        ``'two-sided'``.
    n_resamples : int, optional
        Number of permutation resamples, by default ``9999``.
    batch : int or None, optional
        Number of resamples to process per batch. ``None`` uses a single
        batch.
    rng : numpy.random.Generator or None, optional
        Random number generator for reproducibility. ``None`` creates a
        fresh one.
    weights : ContextualWeights or None, optional
        Importance weights to correct for covariate shift and related concerns
        between source and target. Build from domain probabilities using
        :func:`~samesame.weights.contextual_weights`, or construct
        ``ContextualWeights(source=..., target=...)`` directly.
        Pass ``None`` (default) to run an unweighted test.

    Returns
    -------
    ShiftDetails
        Immutable result with ``statistic``, ``pvalue``, ``statistic_name``,
        and ``null_distribution``.
    """
    dataset = build_two_sample_dataset(source, target)
    actual, predicted = dataset.labels, dataset.scores
    statistic_name, metric = get_shift_metric(statistic)
    _validate_shift_scores(statistic_name, predicted)
    effective_weight = _resolve_weights(weights, dataset.n_source, dataset.n_target)
    result = _run_permutation_test(
        actual,
        predicted,
        metric,
        n_resamples=n_resamples,
        alternative=alternative,
        sample_weight=effective_weight,
        rng=rng,
        batch=batch,
    )
    return ShiftDetails(
        statistic=float(result.statistic),
        pvalue=float(result.pvalue),
        statistic_name=statistic_name,
        null_distribution=np.asarray(result.null_distribution, dtype=np.float64),
    )