Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples

Authors: Marco Jiralerspong, Joey Bose, Ian Gemp, Chongli Qin, Yoram Bachrach, Gauthier Gidel

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail. We also extensively evaluate FLD on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models.
Researcher Affiliation	Collaboration	Marco Jiralerspong Université de Montréal and Mila Avishek (Joey) Bose Mc Gill University and Mila Ian Gemp Google Deepmind Chongli Qin Google Deepmind Yoram Bachrach Google Deepmind Gauthier Gidel Université de Montréal and Mila
Pseudocode	Yes	Algorithm 1 Fitting Mo Gs for FLD
Open Source Code	Yes	Code is available at https://github.com/marcojira/fld.
Open Datasets	Yes	natural image benchmarks in CIFAR10 [Krizhevsky et al., 2014], FFHQ [Karras et al., 2019] and Image Net [Deng et al., 2009].
Dataset Splits	No	The paper discusses the use of 'test set as reference (10k samples)' and mentions standard FID computations using '50k generated samples and 50k training samples', but it does not provide explicit percentages, absolute counts, or detailed methodology for how the training, validation, and test splits were prepared for their own experiments in a reproducible manner. It implies the use of standard dataset splits but does not specify them.
Hardware Specification	Yes	Time taken (on 1x RTX8000) for different metrics as we vary the number of train samples.
Software Dependencies	No	The paper mentions 'torchvision [maintainers and contributors, 2016]' but does not provide specific version numbers for it or any other software libraries or dependencies used in their experiments.
Experiment Setup	Yes	10000 generated samples 50 epochs lr = 0.5 Initial value for the variance vector: 0