Beyond statistical tests of equal distribution and of mean differences for dataset shift.
Sometimes the pertinent question is, are we worse off? Annie, what we really want to know is, are you OK? We compare the way things were (the happy past) to the way they are (the tumultuous present). But wait, your statistical mind makes the leap, is this not just a two-sample comparison (test)? It is. But the trouble is not all statistical tests are well-suited for this task. The main point of this post is to argue that some widely used statistical tests may come close but often miss the mark when it comes to answering this question. The Question © – that is, once more so you don’t forget, are we worse off? – is what I mean when I say testing for harmful or adverse shift.
But first off, you may wonder, why does it matter? One reason to care in machine
learning is dataset shift. Over time, things change, data drift and predictive
models, evidently built on past data,
suffer as a consequence. Model performance deteriorates, sometimes quite
drastically. Chip Huyen has a great primer on dataset shift
here
- it is a fantastic resource (go read it!). It turns out, to detect dataset
shift, we often rely on statistical tests. The most common ones are
tests of equal distribution and of mean differences. And I posit, both
let you down in perhaps surprising ways when it comes down to the Question ©
Which is? You still with me, right?
In a previous post, I praised modern tests of equal distribution. But I did warn you that I had some beef too. It is time to settle the score. These tests fail us when testing for adverse shift1.They fail because “not all changes in distribution are a cause for concern - some changes are benign” (Kamulete 2022). A simple example can convince you of this. Suppose your original sample was contaminated with a few outliers but your new sample is not. You’re clearly better off. Well, tests of equal distribution would still tell you that the two samples are different i.e. you would reject the null of no difference. This is a false alarm. These tests don’t answer the Question ©; they tell us when the two samples can be said to be drawn from the same distribution, not whether we are worse off. As we know all too well, different does not mean inferior. We need a better test.
What about the statistical tests for mean differences? Lindon, Sanden, and Shirikian (2022) explains the problem:
[..] Not all bugs or performance regressions can be captured by differences in the mean alone […]. Consider PlayDelay, an important key indicator of the streaming quality at Netflix, which measures the time taken for a title to start once the user has hit the play button. It is possible for the mean […] to be the same, but for the tails of the distribution to be heavier […], resulting in an increased occurrence of extreme values. Large values of PlayDelay, even if infrequent, are unacceptable and considered a severe performance regression by our engineering teams.
Again, we need a better test. Incidentally, if you were not already convinced that the Question © is important, I know you are now: it is your Netflix bingeing at stake here. In short, both tests of equal distribution and of mean differences are lacking when it comes to the Question ©. That’s the bad news.
What’s the good news? If these widely used statistical
tests are not good enough, what are the alternatives? I referred to two already
(Kamulete 2022; Lindon, Sanden, and Shirikian 2022)2. Recent works
(Podkopaev and Ramdas 2021; Luo et al. 2022; Vovk et al. 2021; Harel et al. 2014)
also attempt to tackle the Question © more rigorously (precisely)
3. When discussing my own work and the dsos
package at
useR! 2022, I chose the title
“Call me when it hurts” because it gets at the core of the
Question ©. Data scientists want to know when there is a real problem:
raising too many false alarms is one of the fastest and surest way to lose
credibility.
Let me conclude. If you stuck with me this far, first I am sorry, not sorry 😉, for abusing the Question © to mean testing for adverse shift. It is, we agree, a silly trope not intended to be taken too seriously. Second, while we undoubtedly need better statistical tests than those of equal distribution and of mean differences for testing for adverse shift, we also happen to have a growing number of such tests at our disposal. We should use them. Alright, this is long enough. I need to get back to work on the next one (Kamulete 2023).
It isn’t their fault; it’s ours, the users. They simply do what they were designed to do, no more, no less, as infuriating as that may be.↩︎
One paradigm that to my mind does not get its due in ML/AI circles is statistical process control or statistical quality control (Gandy and Kvaløy 2017; Flores et al. 2021). I am, unhappily I assure you, complicit in that here. This community also cares about the the Question ©. In fact, some may rightfully say they were pioneers in this area.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/vathymut/vathymut.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kamulete (2023, Jan. 3). Vathy M. Kamulete: Are you OK? Test for harmful (adverse) shift. Retrieved from https://vathymut.org/posts/2023-01-03-are-you-ok/
BibTeX citation
@misc{kamulete2023are, author = {Kamulete, Vathy M.}, title = {Vathy M. Kamulete: Are you OK? Test for harmful (adverse) shift}, url = {https://vathymut.org/posts/2023-01-03-are-you-ok/}, year = {2023} }