Asymptotic Test With Out-Of-Bag Scores

Test for no adverse shift with outlier scores. Like goodness-of-fit testing, this two-sample comparison takes the training set, x_train as the as the reference. The method checks whether the test set, x_test, is worse off relative to this reference set. The function scorer assigns an outlier score to each instance/observation in both training and test set.

at_oob(x_train, x_test, scorer)

Arguments

x_train: Training (reference/validation) sample.
x_test: Test sample.
scorer: Function which returns a named list with outlier scores from the training and test sample. The first argument to scorer must be x_train; the second, x_test. The returned named list contains two elements: train and test, each of which is a vector of (outlier) scores. See notes for more information.

Value

A named list of class outlier.test containing:

statistic: observed WAUC statistic
seq_mct: sequential Monte Carlo test, when applicable
p_value: p-value
outlier_scores: outlier scores from training and test set

Details

Li and Fine (2010) derives the asymptotic null distribution for the weighted AUC (WAUC), the test statistic. This approach does not use permutations and can, as a result, be much faster because it sidesteps the need to refit the scoring function scorer. This works well for large samples. The prefix at stands for asymptotic test to tell it apart from the prefix pt, the permutation test.

Notes

The scoring function, scorer, predicts out-of-bag scores to mimic out-of-sample behaviour. The suffix oob stands for out-of-bag to highlight this point. This out-of-bag variant avoids refitting the underlying algorithm from scorer at every permutation. It can, as a result, be computationally appealing.

References

Kamulete, V. M. (2022). Test for non-negligible adverse shifts. In The 38th Conference on Uncertainty in Artificial Intelligence. PMLR.

Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.

Examples

# \donttest{
library(dsos)
set.seed(12345)
data(iris)
setosa <- iris[1:50, 1:4] # Training sample: Species == 'setosa'
versicolor <- iris[51:100, 1:4] # Test sample: Species == 'versicolor'

# Using fake scoring function
scorer <- function(tr, te) list(train=runif(nrow(tr)), test=runif(nrow(te)))
oob_test <- at_oob(setosa, versicolor, scorer = scorer)
oob_test
#> 	Frequentist test for no adverse shift 
#> 
#> p-value = 0.17518, test statistic (weighted AUC/WAUC) = 0.101
#> 
#> Alternative hypothesis: Pr(WAUC >= 0.101)
#> => the test set is worse off than training.
#> Sample sizes: 50 in training and 50 in test set.

# }