Informative Features for Model Comparison
Authors: Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn Sangkloy, James Hays, Bernhard Schölkopf, Arthur Gretton
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate the two proposed tests on both toy and real problems. We start with an illustration of the behaviors of Rel-UME and Rel-FSSD s power criteria using simple one dimensional problems. In the second experiment, we examine the test powers of the two proposed tests using three toy problems. In the third experiment, we compare two hypothetical generative models on the CIFAR-10 dataset [Krizhevsky and Hinton, 2009] and demonstrate that the learned test locations (images) can clearly indicate the types of images that are better modeled by one of the two candidate models. In the last two experiments, we consider the problem of determining the relative goodness of fit of two given Generative Adversarial Networks (GANs) [Goodfellow ets al., 2014]. |
| Researcher Affiliation | Academia | Wittawat Jitkrittum Max Planck Institute for Intelligent Systems wittawat@tuebingen.mpg.de Heishiro Kanagawa Gatsby Unit, UCL heishirok@gatsby.ucl.ac.uk Patsorn Sangkloy Georgia Institute of Technology patsorn_sangkloy@gatech.edu James Hays Georgia Institute of Technology hays@gatech.edu Bernhard Schölkopf Max Planck Institute for Intelligent Systems bernhard.schoelkopf@tuebingen.mpg.de Arthur Gretton Gatsby Unit, UCL arthur.gretton@gmail.com |
| Pseudocode | No | The paper describes the proposed methods mathematically and in text but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code to reproduce all the results is available at https://github.com/wittawatj/kernel-mod. |
| Open Datasets | Yes | In the third experiment, we compare two hypothetical generative models on the CIFAR-10 dataset [Krizhevsky and Hinton, 2009]... We consider the Celeb A dataset [Liu et al., 2015]3... We set P and Q to be LSGAN models after 15 epochs and 17 epochs of training, respectively. Details regarding the network architecture, training, and the kernel (chosen to be a Gaussian kernel on features extracted from a convolutional network) can be found in Section D. Samples from P and Q are shown in Figures 5a and 5b (see Figure 8 in the appendix for more samples). |
| Dataset Splits | Yes | To control the false rejection rate, the maximization is carried out on held-out training data which are independent of the data used for testing. In the experiments (Section 4), we hold out 20% of the data for the optimization. A set of 30000 images from each subset is held out for subsequent relative goodness-of-fit testing, and the rest are used for training two GAN models. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU types, or cloud computing instances. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the experiments, such as programming languages or machine learning frameworks. |
| Experiment Setup | Yes | The sample size is n = 2000, and each problem variation is repeated for 10 trials for FID (due to its high complexity) and 100 trials for other methods. We set the test locations V to be the set Vi containing J = 40 randomly selected real images of digit i, for i {0, . . . , 9}. We then draw n = 2000 points from P, Q and the real data (R), and use V = Vi to compute the power criterion for i {0, . . . , 9}. The procedure is repeated for 200 trials where V and the samples are redrawn each time. |