reproducibilityindex.ai

A note on the evaluation of generative models

Authors: Lucas Theis, Aäron van den Oord, Matthias Bethge

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In particular, we show that three of the currently most commonly used criteria average log-likelihood, Parzen window estimates, and visual ﬁdelity of samples are largely independent of each other when the data is high-dimensional. Good performance with respect to one criterion therefore need not imply good performance with respect to the other criteria. Our results show that extrapolation from one criterion to another is not warranted and generative models need to be evaluated directly with respect to the application(s) they were intended for. In addition, we provide examples demonstrating that Parzen window estimates should generally be avoided.
Researcher Affiliation	Collaboration	Lucas Theis University of T ubingen 72072 T ubingen, Germany lucas@bethgelab.org A aron van den Oord Ghent University 9000 Ghent, Belgium aaron.vandenoord@ugent.be Matthias Bethge University of T ubingen 72072 T ubingen, Germany matthias@bethgelab.org These authors contributed equally to this work. Now at Google Deep Mind.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include an unambiguous statement or link indicating that the authors are releasing the source code for the methodology described in the paper.
Open Datasets	Yes	Figure 2: A: Two examples demonstrating that small changes of an image can lead to large changes in Euclidean distance affecting the choice of nearest neighbor. The images shown represent the query image shifted by between 1 and 4 pixels (left column, top to bottom), and the corresponding nearest neighbor from the training set (right column). The gray lines indicate Euclidean distance of the query image to 100 randomly picked images from the training set. B: Fraction of query images assigned to the correct training image. The average was estimated from 1,000 images. Dashed lines indicate a 90% conﬁdence interval. ... of the CIFAR-10 dataset. ... Table 1: Using Parzen window estimates to evaluate various models trained on MNIST, samples from the true distribution perform worse than samples from a simple model trained with k-means.
Dataset Splits	No	The paper mentions training data but does not specify details about validation splits, percentages, or methodology for data partitioning beyond stating the use of standard datasets.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup	Yes	Figure 1: An isotropic Gaussian distribution was ﬁt to data drawn from a mixture of Gaussians by either minimizing Kullback-Leibler divergence (KLD), maximum mean discrepancy (MMD), or Jensen-Shannon divergence (JSD). The different ﬁts demonstrate different tradeoffs made by the three measures of distance between distributions. ... Parameters were initialized at the maximum likelihood solution in all cases, but the same optimum was consistently found using random initializations. ... In Figure 3 we plot Parzen window estimates for a multivariate Gaussian distribution ﬁt to small CIFAR-10 image patches (of size 6 by 6). We added uniform noise to the data (as explained in Section 3.1) and rescaled between 0 and 1. ... To illustrate this, we ﬁtted 10,000 centroids to the training data using k-means. We then generated 10,000 independent samples by sampling centroids with replacement.