In gentle praise of classifier tests

Two-sample statistical tests based on modern high-capacity classifiers are powerful and still, underappreciated.

Vathy M. Kamulete

The late great statistician, Sir David Cox, has a talk titled In gentle praise of significance tests. It is, of course, only gentle praise because inevitably, there are some issues. I too have a bone to pick. But we will get to the beef much later. First, the praise.

Remember the old two-sample tests? They go by different names: test of equal distribution, of homogeneity, of goodness-of-fit, etc. The gist of it is you have two samples. And so, you wonder, are they similar enough? Rings a bell, right? The Kolmogorov–Smirnov test is a classic for this sort of thing. But the K-S test is old news, well before the advent of big and high dimensional data 😱.

Since then, the field has been churning out powerful two-sample tests for the Big Data era. Every year, exciting works are published on the subject. I recently messed about with the Ball divergence approach (Pan et al. 2018). You, no doubt, chose your own adventure. There has been tremendous progress. Let Uncle Larry explain1. And now you are, more or less, caught up with the state-of-the-art in two-sample tests. The End.

OK, not so fast. You see, Uncle Larry left out something important. He touches on 3 ideas: kernel, energy, and cross-match tests. Trust me when I say, he wants you to know about the classifier tests too (Kim et al. 2021). Here is how they work. You assign your samples to different classes (you give them labels). Sample 1 is the positive class, and Sample 2, the negative one or vice versa. Then, you fit your favorite classifier to see if it can reliably predict the label. If it can, it means the two samples are probably different. If it cannot, the two samples are similar enough. It seems obvious in hindsight, does it not? It is not even deceptively simple: it is actually simple.

And yet, most people do not seem to know about two-sample classifier tests. In comparison, kernel tests seem ubiquitous (if you are looking at academic journals, not if you’re hanging out on Twitter). Even the energy tests, one of the innovation that Uncle Larry discusses, are in a fundamental sense equivalent to the kernel tests (Sejdinovic et al. 2012; Shen and Vogelstein 2020). It is kernels all the way down. On the theoretical front, the mathematical maturity to wield this awesome kernel power may require arcane incantations to RKHS voodoo magic. But no matter, word to Eric B. & Rakim, we won’t sweat the technique.

Still, this all begs the question. How do the humble classifier tests stack up against the more celebrated kernel tests? No spoiler from me. Let Lopez-Paz and Oquab (2017) do it:

Our take-home message is that modern binary classifiers can be easily turned into powerful two-sample tests. We have shown that these classifier two-sample tests set a new state-of-the-art in performance, and enjoy unique attractive properties: they are easy to implement, learn a representation of the data on the fly, have simple asymptotic distributions, and allow different ways to interpret how the two samples under study differ.

They do not often tell you this. You can in fact think of a kernel test as a classifier test. And you know what, data scientists are really good at supervised classification. Say \(p-\)values, eyes glaze over and no one is listening. Say 50% accuracy instead, suddenly everyone understands and is nodding vigorously. No RKHS magic needed – this is yuge.


Here is Cai, Goggin, and Jiang (2020), more recently, to hammer the point home:

We propose a test for two-sample problem based on estimates of classification probabilities obtained from a consistent classification algorithm. […] Our test is more powerful and efficient than many other tests.

Practicing data scientists cannot afford to ignore two-sample classifier tests. They are powerful, easy to implement and easy to explain, the elusive trifecta2. Plenty of works advocate for the classifier tests (Friedman 2004; Vayatis, Depecker, and Clémençcon 2009; Liu, Li, and Póczos 2018; Hediger, Michel, and Näf 2019). Still, they remain for the most part underappreciated. No one (anyone you know?) brags about them, the way for example that they would about the newest kernel test on the block. I suspect it is because the theory seems boring in comparison. In practice, these guys pack a punch. This is worth praising. The two-sample classifier tests need more love. They deserve it.

Cai, Haiyan, Bryan Goggin, and Qingtang Jiang. 2020. “Two-Sample Test Based on Classification Probability.” Statistical Analysis and Data Mining: The ASA Data Science Journal 13 (1): 5–13.
Friedman, Jerome. 2004. “On Multivariate Goodness-of-Fit and Two-Sample Testing.” Stanford Linear Accelerator Center, Menlo Park, CA (US).
Hediger, Simon, Loris Michel, and Jeffrey Näf. 2019. “On the Use of Random Forest for Two-Sample Testing.” arXiv Preprint arXiv:1903.06287.
Kim, Ilmun, Aaditya Ramdas, Aarti Singh, and Larry Wasserman. 2021. “Classification Accuracy as a Proxy for Two-Sample Testing.” The Annals of Statistics 49 (1): 411–34.
Liu, Yusha, Chun-Liang Li, and Barnabás Póczos. 2018. “Classifier Two Sample Test for Video Anomaly Detections.” In BMVC, 71.
Lopez-Paz, David, and Maxime Oquab. 2017. “Revisiting Classifier Two-Sample Tests.” In International Conference on Learning Representations.
Pan, Wenliang, Yuan Tian, Xueqin Wang, and Heping Zhang. 2018. “Ball Divergence: Nonparametric Two Sample Test.” Annals of Statistics 46 (3): 1109.
Sejdinovic, Dino, Arthur Gretton, Bharath Sriperumbudur, and Kenji Fukumizu. 2012. “Hypothesis Testing Using Pairwise Distances and Associated Kernels.” In 29th International Conference on Machine Learning, ICML 2012, 1111–18.
Shen, Cencheng, and Joshua T Vogelstein. 2020. “The Exact Equivalence of Distance and Kernel Methods in Hypothesis Testing.” AStA Advances in Statistical Analysis, 1–19.
Vayatis, Nicolas, Marine Depecker, and Stéphan Clémençcon. 2009. “AUC Optimization and the Two-Sample Problem.” Advances in Neural Information Processing Systems 22: 360–68.

  1. Click on that link. I won’t be offended; I will in fact, happily, wait.↩︎

  2. Think the son, the father and the holy spirit, the divine trinity.↩︎



If you see mistakes or want to suggest changes, please create an issue on the source repository.


Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".


For attribution, please cite this work as

Kamulete (2022, Jan. 22). Vathy M. Kamulete: In gentle praise of classifier tests. Retrieved from

BibTeX citation

  author = {Kamulete, Vathy M.},
  title = {Vathy M. Kamulete: In gentle praise of classifier tests},
  url = {},
  year = {2022}