A note on the evaluation of generative models
Authors: Lucas Theis, Aäron van den Oord, Matthias Bethge
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In particular, we show that three of the currently most commonly used criteria average log-likelihood, Parzen window estimates, and visual fidelity of samples are largely independent of each other when the data is high-dimensional. Good performance with respect to one criterion therefore need not imply good performance with respect to the other criteria. Our results show that extrapolation from one criterion to another is not warranted and generative models need to be evaluated directly with respect to the application(s) they were intended for. In addition, we provide examples demonstrating that Parzen window estimates should generally be avoided. |
| Researcher Affiliation | Collaboration | Lucas Theis University of T ubingen 72072 T ubingen, Germany lucas@bethgelab.org A aron van den Oord Ghent University 9000 Ghent, Belgium aaron.vandenoord@ugent.be Matthias Bethge University of T ubingen 72072 T ubingen, Germany matthias@bethgelab.org These authors contributed equally to this work. Now at Google Deep Mind. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement or link indicating that the authors are releasing the source code for the methodology described in the paper. |
| Open Datasets | Yes | Figure 2: A: Two examples demonstrating that small changes of an image can lead to large changes in Euclidean distance affecting the choice of nearest neighbor. The images shown represent the query image shifted by between 1 and 4 pixels (left column, top to bottom), and the corresponding nearest neighbor from the training set (right column). The gray lines indicate Euclidean distance of the query image to 100 randomly picked images from the training set. B: Fraction of query images assigned to the correct training image. The average was estimated from 1,000 images. Dashed lines indicate a 90% confidence interval. ... of the CIFAR-10 dataset. ... Table 1: Using Parzen window estimates to evaluate various models trained on MNIST, samples from the true distribution perform worse than samples from a simple model trained with k-means. |
| Dataset Splits | No | The paper mentions training data but does not specify details about validation splits, percentages, or methodology for data partitioning beyond stating the use of standard datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | Figure 1: An isotropic Gaussian distribution was fit to data drawn from a mixture of Gaussians by either minimizing Kullback-Leibler divergence (KLD), maximum mean discrepancy (MMD), or Jensen-Shannon divergence (JSD). The different fits demonstrate different tradeoffs made by the three measures of distance between distributions. ... Parameters were initialized at the maximum likelihood solution in all cases, but the same optimum was consistently found using random initializations. ... In Figure 3 we plot Parzen window estimates for a multivariate Gaussian distribution fit to small CIFAR-10 image patches (of size 6 by 6). We added uniform noise to the data (as explained in Section 3.1) and rescaled between 0 and 1. ... To illustrate this, we fitted 10,000 centroids to the training data using k-means. We then generated 10,000 independent samples by sampling centroids with replacement. |