reproducibilityindex.ai

AutoML Two-Sample Test

Authors: Jonas M. Kübler, Vincent Stimper, Simon Buchholz, Krikamol Muandet, Bernhard Schölkopf

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively study the empirical performance of our approach first by considering the two low-dimensional datasets Blob and Higgs followed by running a large benchmark on a variety of distribution shifts on MNIST and CIFAR10 data. We observe very competitive performance without any manual adjustment of hyperparameters. Our experiments also show that a continuous witness outperforms commonly used binary classifiers (Section 5).
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Pseudocode	Yes	Figure 1: Auto ML two-sample test: SP , SQ denotes the available data from P and Q, which is first split into two parts of equal size. A witness h : X R is trained using a (weighted) squared loss Eq. (6), denoted by MSE, and using Auto ML to maximize predictive performance. Users can easily control important properties, for example the maximal runtime tmax. The test statistic τ is the difference in means on the test sets. Permuting the data and recomputing τ allows the estimation of the p-values. The null hypothesis P = Q is rejected if p α.
Open Source Code	Yes	We provide the Python Package autotst implementing our testing pipeline.
Open Datasets	Yes	We repeat their experiments by considering the datasets MNIST [Le Cun et al., 2010] and CIFAR10 [Krizhevsky, 2009].
Dataset Splits	Yes	The data is split into two equally sized parts since this is the standard approach [Lopez-Paz and Oquab, 2017, Liu et al., 2020]. We label data from P with 1 , data from Q with 0 and fit a least square regression with Auto Gluon s Tabular Predictor [Erickson et al., 2020].
Hardware Specification	No	All experiments were done on servers having only CPUs and we spend around 100k CPU hours on doing all the experiments reported in the paper, which is mainly because we did various configurations and many repetitions for all the test cases we consider.
Software Dependencies	No	The paper mentions using 'Auto Gluon s Tabular Predictor' and the 'Python package autotst' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We use the configuration presets= best_quality and by default optimize with a five-minute time limit. For more details, we refer to the Auto Gluon documentation. We run all experiments with significance level α = 5%.