Two-sample statistical tests based on modern high-capacity classifiers are powerful and still, underappreciated.

The late great statistician, Sir David Cox,
has a talk titled In gentle praise of significance tests.
It is, of course, only *gentle praise* because inevitably, there are some
issues. I too have a bone to pick. But
we will get to the beef much later. First, the *praise*.

Remember the old two-sample tests? They go by different names: test of equal
distribution, of homogeneity, of goodness-of-fit, etc. The gist of it is you
have two samples. And so, you wonder, are they similar enough? Rings a bell,
right? The Kolmogorov–Smirnov
test is a
classic for this sort of thing. But the
K-S test is old news, well before the
advent of *big* and *high dimensional* data 😱.

Since then, the field has been churning out powerful two-sample tests for the
Big Data era. Every year, exciting works are published on the subject. I
recently messed about with the Ball divergence approach (Pan et al. 2018).
You, no doubt, chose your own adventure. There has been tremendous
progress. Let Uncle Larry
explain^{1}.
And now you are, more or less, caught up with the state-of-the-art in
two-sample tests. The End.

OK, not so fast. You see, Uncle Larry left out
something important. He touches on 3 ideas: kernel, energy, and cross-match
tests. Trust me when I say, he wants you to know about the classifier tests too
(Kim et al. 2021). Here is how they work. You assign your samples to
different classes (you give them labels). Sample 1 is the positive class,
and Sample 2, the negative one or vice versa. Then, you fit your
favorite classifier to see if it can reliably predict the label. If
it can, it means the two samples are probably different. If
it cannot, the two samples are similar enough. It seems obvious in hindsight,
does it not? It is not even *deceptively* simple: it is *actually* simple.

And yet, most people do not seem to know about two-sample classifier tests. In comparison, kernel tests seem ubiquitous (if you are looking at academic journals, not if you’re hanging out on Twitter). Even the energy tests, one of the innovation that Uncle Larry discusses, are in a fundamental sense equivalent to the kernel tests (Sejdinovic et al. 2012; Shen and Vogelstein 2020). It is kernels all the way down. On the theoretical front, the mathematical maturity to wield this awesome kernel power may require arcane incantations to RKHS voodoo magic. But no matter, word to Eric B. & Rakim, we won’t sweat the technique.

Still, this all begs the question. How do the humble classifier tests stack up against the more celebrated kernel tests? No spoiler from me. Let Lopez-Paz and Oquab (2017) do it:

Our take-home message is that modern binary classifiers can be easily turned into powerful two-sample tests. We have shown that these classifier two-sample tests set a new state-of-the-art in performance, and enjoy unique attractive properties: they are easy to implement, learn a representation of the data on the fly, have simple asymptotic distributions, and allow different ways to interpret how the two samples under study differ.

They do not often tell you this. You can in fact think of a kernel test as a classifier test. And you know what, data scientists are really good at supervised classification. Say \(p-\)values, eyes glaze over and no one is listening. Say 50% accuracy instead, suddenly everyone understands and is nodding vigorously. No RKHS magic needed – this is yuge.

Here is Cai, Goggin, and Jiang (2020), more recently, to hammer the point home:

We propose a test for two-sample problem based on estimates of classification probabilities obtained from a consistent classification algorithm. […] Our test is more powerful and efficient than many other tests.

Practicing data scientists cannot afford to ignore two-sample classifier tests.
They are powerful, easy to implement and easy to explain, the elusive
trifecta^{2}. Plenty of works advocate for the
classifier tests (Friedman 2004; Vayatis, Depecker, and Clémençcon 2009; Liu, Li, and Póczos 2018; Hediger, Michel, and Näf 2019). Still,
they remain for the most part underappreciated. No one
(anyone you know?) brags about them, the way for example that they would about
the newest kernel test on the block. I suspect it is because the theory seems
*boring* in comparison. In practice, these guys pack a punch. This is worth
praising. The two-sample classifier tests need more love. They deserve it.

Cai, Haiyan, Bryan Goggin, and Qingtang Jiang. 2020. “Two-Sample Test Based on Classification Probability.” *Statistical Analysis and Data Mining: The ASA Data Science Journal* 13 (1): 5–13.

Friedman, Jerome. 2004. “On Multivariate Goodness-of-Fit and Two-Sample Testing.” Stanford Linear Accelerator Center, Menlo Park, CA (US).

Hediger, Simon, Loris Michel, and Jeffrey Näf. 2019. “On the Use of Random Forest for Two-Sample Testing.” *arXiv Preprint arXiv:1903.06287*.

Kim, Ilmun, Aaditya Ramdas, Aarti Singh, and Larry Wasserman. 2021. “Classification Accuracy as a Proxy for Two-Sample Testing.” *The Annals of Statistics* 49 (1): 411–34.

Liu, Yusha, Chun-Liang Li, and Barnabás Póczos. 2018. “Classifier Two Sample Test for Video Anomaly Detections.” In *BMVC*, 71.

Lopez-Paz, David, and Maxime Oquab. 2017. “Revisiting Classifier Two-Sample Tests.” In *International Conference on Learning Representations*.

Pan, Wenliang, Yuan Tian, Xueqin Wang, and Heping Zhang. 2018. “Ball Divergence: Nonparametric Two Sample Test.” *Annals of Statistics* 46 (3): 1109.

Sejdinovic, Dino, Arthur Gretton, Bharath Sriperumbudur, and Kenji Fukumizu. 2012. “Hypothesis Testing Using Pairwise Distances and Associated Kernels.” In *29th International Conference on Machine Learning, ICML 2012*, 1111–18.

Shen, Cencheng, and Joshua T Vogelstein. 2020. “The Exact Equivalence of Distance and Kernel Methods in Hypothesis Testing.” *AStA Advances in Statistical Analysis*, 1–19.

Vayatis, Nicolas, Marine Depecker, and Stéphan Clémençcon. 2009. “AUC Optimization and the Two-Sample Problem.” *Advances in Neural Information Processing Systems* 22: 360–68.

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/vathymut/vathymut.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

Kamulete (2022, Jan. 22). Vathy M. Kamulete: In gentle praise of classifier tests. Retrieved from https://vathymut.org/posts/2022-01-22-in-gentle-praise-of-modern-tests/

BibTeX citation

@misc{kamulete2022in, author = {Kamulete, Vathy M.}, title = {Vathy M. Kamulete: In gentle praise of classifier tests}, url = {https://vathymut.org/posts/2022-01-22-in-gentle-praise-of-modern-tests/}, year = {2022} }