reproducibilityindex.ai

Informative Features for Model Comparison

Authors: Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn Sangkloy, James Hays, Bernhard Schölkopf, Arthur Gretton

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we demonstrate the two proposed tests on both toy and real problems. We start with an illustration of the behaviors of Rel-UME and Rel-FSSD s power criteria using simple one dimensional problems. In the second experiment, we examine the test powers of the two proposed tests using three toy problems. In the third experiment, we compare two hypothetical generative models on the CIFAR-10 dataset [Krizhevsky and Hinton, 2009] and demonstrate that the learned test locations (images) can clearly indicate the types of images that are better modeled by one of the two candidate models. In the last two experiments, we consider the problem of determining the relative goodness of ﬁt of two given Generative Adversarial Networks (GANs) [Goodfellow ets al., 2014].
Researcher Affiliation	Academia	Wittawat Jitkrittum Max Planck Institute for Intelligent Systems wittawat@tuebingen.mpg.de Heishiro Kanagawa Gatsby Unit, UCL heishirok@gatsby.ucl.ac.uk Patsorn Sangkloy Georgia Institute of Technology patsorn_sangkloy@gatech.edu James Hays Georgia Institute of Technology hays@gatech.edu Bernhard Schölkopf Max Planck Institute for Intelligent Systems bernhard.schoelkopf@tuebingen.mpg.de Arthur Gretton Gatsby Unit, UCL arthur.gretton@gmail.com
Pseudocode	No	The paper describes the proposed methods mathematically and in text but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code to reproduce all the results is available at https://github.com/wittawatj/kernel-mod.
Open Datasets	Yes	In the third experiment, we compare two hypothetical generative models on the CIFAR-10 dataset [Krizhevsky and Hinton, 2009]... We consider the Celeb A dataset [Liu et al., 2015]3... We set P and Q to be LSGAN models after 15 epochs and 17 epochs of training, respectively. Details regarding the network architecture, training, and the kernel (chosen to be a Gaussian kernel on features extracted from a convolutional network) can be found in Section D. Samples from P and Q are shown in Figures 5a and 5b (see Figure 8 in the appendix for more samples).
Dataset Splits	Yes	To control the false rejection rate, the maximization is carried out on held-out training data which are independent of the data used for testing. In the experiments (Section 4), we hold out 20% of the data for the optimization. A set of 30000 images from each subset is held out for subsequent relative goodness-of-ﬁt testing, and the rest are used for training two GAN models.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU types, or cloud computing instances.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies used in the experiments, such as programming languages or machine learning frameworks.
Experiment Setup	Yes	The sample size is n = 2000, and each problem variation is repeated for 10 trials for FID (due to its high complexity) and 100 trials for other methods. We set the test locations V to be the set Vi containing J = 40 randomly selected real images of digit i, for i {0, . . . , 9}. We then draw n = 2000 points from P, Q and the real data (R), and use V = Vi to compute the power criterion for i {0, . . . , 9}. The procedure is repeated for 200 trials where V and the samples are redrawn each time.