Comparing distributions: $\ell_1$ geometry improves kernel two-sample testing

Authors: meyer scetbon, Gael Varoquaux

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on artificial and real-world problems demonstrate improved power/time tradeoff than the state of the art, based on 2 norms, and in some cases, better outright power than even the most expensive quadratic-time tests.
Researcher Affiliation Academia Meyer Scetbon CREST, ENSAE & Inria, Université Paris-Saclay Gaël Varoquaux Inria, Université Paris-Saclay
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/meyerscetbon/l1_two_sample_test.
Open Datasets Yes The code is available at https://github.com/meyerscetbon/l1_two_sample_test. ... Real Data 1, Higgs: The first real problem is the Higgs dataset [21] described in [2]... Real Data 2, Fastfood: We use a Kaggle dataset listing locations of over 10,000 fast food restaurants across America4. ... 4www.kaggle.com/datafiniti/fast-food-restaurants ... Real Data 3, text: For a high-dimension problem, we consider the problem of distinguishing the newsgroups text dataset [18]
Dataset Splits No For the second and third real problem (Fast food and text datasets), samples are split randomly into train and test sets in each trial. ... We call L1-opt-ME and L1-opt-SCF the tests based respectively on mean embeddings and smooth characteristic functions proposed in this paper when optimizing test locations and the Gaussian width σ on a separate training set of the same size as the test set. The paper mentions training and test sets but does not specify validation splits or exact percentages/counts for the splits for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) used to replicate the experiment.
Experiment Setup Yes We set = 0.01. The regularization parameter is set to γN1,N2 = 10 5. ... For the ME-based tests, we initialize the test locations with realizations from two multivariate normal distributions fitted to samples from P and Q and for the for initialization of the SCF-based tests, we use the standard normal distribution.