Comparing distributions: $\ell_1$ geometry improves kernel two-sample testing
Authors: meyer scetbon, Gael Varoquaux
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on artificial and real-world problems demonstrate improved power/time tradeoff than the state of the art, based on 2 norms, and in some cases, better outright power than even the most expensive quadratic-time tests. |
| Researcher Affiliation | Academia | Meyer Scetbon CREST, ENSAE & Inria, Université Paris-Saclay Gaël Varoquaux Inria, Université Paris-Saclay |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/meyerscetbon/l1_two_sample_test. |
| Open Datasets | Yes | The code is available at https://github.com/meyerscetbon/l1_two_sample_test. ... Real Data 1, Higgs: The first real problem is the Higgs dataset [21] described in [2]... Real Data 2, Fastfood: We use a Kaggle dataset listing locations of over 10,000 fast food restaurants across America4. ... 4www.kaggle.com/datafiniti/fast-food-restaurants ... Real Data 3, text: For a high-dimension problem, we consider the problem of distinguishing the newsgroups text dataset [18] |
| Dataset Splits | No | For the second and third real problem (Fast food and text datasets), samples are split randomly into train and test sets in each trial. ... We call L1-opt-ME and L1-opt-SCF the tests based respectively on mean embeddings and smooth characteristic functions proposed in this paper when optimizing test locations and the Gaussian width σ on a separate training set of the same size as the test set. The paper mentions training and test sets but does not specify validation splits or exact percentages/counts for the splits for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) used to replicate the experiment. |
| Experiment Setup | Yes | We set = 0.01. The regularization parameter is set to γN1,N2 = 10 5. ... For the ME-based tests, we initialize the test locations with realizations from two multivariate normal distributions fitted to samples from P and Q and for the for initialization of the SCF-based tests, we use the standard normal distribution. |