Vathy M. Kamulete: Are you OK? Test for harmful (adverse) shift

Vathy M. Kamulete

Sometimes the pertinent question is, are we worse off? Annie, what we really want to know is, are you OK? We compare the way things were (the happy past) to the way they are (the tumultuous present). But wait, your statistical mind makes the leap, is this not just a two-sample comparison (test)? It is. But the trouble is not all statistical tests are well-suited for this task. The main point of this post is to argue that some widely used statistical tests may come close but often miss the mark when it comes to answering this question. The Question © – that is, once more so you don’t forget, are we worse off? – is what I mean when I say testing for harmful or adverse shift.

But first off, you may wonder, why does it matter? One reason to care in machine learning is dataset shift. Over time, things change, data drift and predictive models, evidently built on past data, suffer as a consequence. Model performance deteriorates, sometimes quite drastically. Chip Huyen has a great primer on dataset shift here - it is a fantastic resource (go read it!). It turns out, to detect dataset shift, we often rely on statistical tests. The most common ones are tests of equal distribution and of mean differences. And I posit, both let you down in perhaps surprising ways when it comes down to the Question © ~~Which is? You still with me, right?~~

In a previous post, I praised modern tests of equal distribution. But I did warn you that I had some beef too. It is time to settle the score. These tests fail us when testing for adverse shift¹.They fail because “not all changes in distribution are a cause for concern - some changes are benign” (Kamulete 2022). A simple example can convince you of this. Suppose your original sample was contaminated with a few outliers but your new sample is not. You’re clearly better off. Well, tests of equal distribution would still tell you that the two samples are different i.e. you would reject the null of no difference. This is a false alarm. These tests don’t answer the Question ©; they tell us when the two samples can be said to be drawn from the same distribution, not whether we are worse off. As we know all too well, different does not mean inferior. We need a better test.

What about the statistical tests for mean differences? Lindon, Sanden, and Shirikian (2022) explains the problem:

[..] Not all bugs or performance regressions can be captured by differences in the mean alone […]. Consider PlayDelay, an important key indicator of the streaming quality at Netflix, which measures the time taken for a title to start once the user has hit the play button. It is possible for the mean […] to be the same, but for the tails of the distribution to be heavier […], resulting in an increased occurrence of extreme values. Large values of PlayDelay, even if infrequent, are unacceptable and considered a severe performance regression by our engineering teams.

Again, we need a better test. Incidentally, if you were not already convinced that the Question © is important, I know you are now: it is your Netflix bingeing at stake here. In short, both tests of equal distribution and of mean differences are lacking when it comes to the Question ©. That’s the bad news.

What’s the good news? If these widely used statistical tests are not good enough, what are the alternatives? I referred to two already (Kamulete 2022; Lindon, Sanden, and Shirikian 2022)². Recent works (Podkopaev and Ramdas 2021; Luo et al. 2022; Vovk et al. 2021; Harel et al. 2014) also attempt to tackle the Question © more rigorously (precisely) ³. When discussing my own work and the dsos package at useR! 2022, I chose the title “Call me when it hurts” because it gets at the core of the Question ©. Data scientists want to know when there is a real problem: raising too many false alarms is one of the fastest and surest way to lose credibility.

Let me conclude. If you stuck with me this far, first I am sorry, not sorry 😉, for abusing the Question © to mean testing for adverse shift. It is, we agree, a silly trope not intended to be taken too seriously. Second, while we undoubtedly need better statistical tests than those of equal distribution and of mean differences for testing for adverse shift, we also happen to have a growing number of such tests at our disposal. We should use them. Alright, this is long enough. I need to get back to work on the next one (Kamulete 2023).

Flores, Miguel, Rubén Fernández-Casal, Salvador Naya, and Javier Tarrı́o-Saavedra. 2021. “Statistical Quality Control with the Qcr Package.” R Journal 13 (1): 194–217.

Gandy, Axel, and Jan Terje Kvaløy. 2017. “Spcadjust: An r Package for Adjusting for Estimation Error in Control Charts.” R J. 9 (1): 458.

Harel, Maayan, Shie Mannor, Ran El-Yaniv, and Koby Crammer. 2014. “Concept Drift Detection Through Resampling.” In International Conference on Machine Learning, 1009–17. PMLR.

Kamulete, Vathy M. 2022. “Test for Non-Negligible Adverse Shifts.” In Uncertainty in Artificial Intelligence, 959–68. PMLR.

———. 2023. “Are You OK? A Bayesian Sequential Test for Adverse Shift.”

Lindon, Michael, Chris Sanden, and Vaché Shirikian. 2022. “Rapid Regression Detection in Software Deployments Through Sequential Testing.” arXiv Preprint arXiv:2205.14762.

Luo, Rachel, Rohan Sinha, Ali Hindy, Shengjia Zhao, Silvio Savarese, Edward Schmerling, and Marco Pavone. 2022. “Online Distribution Shift Detection via Recency Prediction.” arXiv Preprint arXiv:2211.09916.

Podkopaev, Aleksandr, and Aaditya Ramdas. 2021. “Tracking the Risk of a Deployed Model and Detecting Harmful Distribution Shifts.” In International Conference on Learning Representations.

Vovk, Vladimir, Ivan Petej, Ilia Nouretdinov, Ernst Ahlberg, Lars Carlsson, and Alex Gammerman. 2021. “Retrain or Not Retrain: Conformal Test Martingales for Change-Point Detection.” In Conformal and Probabilistic Prediction and Applications, 191–210. PMLR.

It isn’t their fault; it’s ours, the users. They simply do what they were designed to do, no more, no less, as infuriating as that may be.↩︎
Full disclosure: I am the author of Kamulete (2022).↩︎
One paradigm that to my mind does not get its due in ML/AI circles is statistical process control or statistical quality control (Gandy and Kvaløy 2017; Flores et al. 2021). I am, unhappily I assure you, complicit in that here. This community also cares about the the Question ©. In fact, some may rightfully say they were pioneers in this area.↩︎

Are you OK? Test for harmful (adverse) shift

References

Corrections

Reuse

Citation