Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Controlling the Fidelity and Diversity of Deep Generative Models via Pseudo Density

Authors: Shuangqi Li, Chen Liu, Tong Zhang, Hieu Le, Sabine Susstrunk, Mathieu Salzmann

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments across diverse datasets and various generative models, including different GANs and diffusion models, demonstrate the effectiveness and generality of our proposed approach in controlling the trade-off between fidelity and diversity. Additionally, our fine-tuning method can be employed to improve the Fréchet Inception Distance (FID) (Heusel et al., 2017) scores by fine-tuning pre-trained models and adjusting their fidelity-diversity trade-off. Aside from the Inception Score (IS) (Salimans et al., 2016) and the Fréchet Inception Distance (FID), as a step towards a more comprehensive evaluation, we employ precision and recall (Kynkäänniemi et al., 2019) as disentangled metrics to separately assess the fidelity and diversity of generative models.
Researcher Affiliation	Academia	Shuangqi Li EMAIL School of Computer and Communication Sciences, EPFL; Chen Liu EMAIL Department of Computer Science, City University of Hong Kong; Tong Zhang EMAIL School of Computer and Communication Sciences, EPFL; Hieu Le EMAIL School of Computer and Communication Sciences, EPFL; Sabine Süsstrunk EMAIL School of Computer and Communication Sciences, EPFL; Mathieu Salzmann EMAIL School of Computer and Communication Sciences, EPFL
Pseudocode	Yes	Algorithm 1 Per-Sample Perturbation Based on Pseudo Density; Algorithm 2 Importance Sampling during Inference; Algorithm 3 Fine-tuning GANs with Importance Sampling
Open Source Code	No	The paper does not provide concrete access information to source code for the methodology described. It does not contain a specific repository link, an explicit code release statement, or indicate code in supplementary materials.
Open Datasets	Yes	The FFHQ (Flickr-Faces-HQ) dataset (Karras et al., 2019) is a high-quality collection of human face images consisting of 70k images with the resolution of 1024 1024. The LSUN-Church and LSUNBedroom are subsets of the LSUN (Large-scale Scene Understanding) dataset (Yu et al., 2015). ...utilizing an image feature extractor such as a Vision Transformer (Dosovitskiy et al., 2020) trained on Image Net (Deng et al., 2009).
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning for training, validation, or testing in its own experiments. It mentions evaluating 'against the training dataset' but does not define splits for its own experimental methodology.
Hardware Specification	No	The paper mentions using '2 GPUs' for GANs and 'a single GPU' for diffusion models, but it does not specify the exact GPU model, CPU model, memory, or other detailed computer specifications used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9, or specific frameworks with versions) needed to replicate the experiment.
Experiment Setup	Yes	Training details We follow the same training hyper-parameters published in the original papers, except the number of GPUs, while keeping the same batch size per GPU. For GANs models in our experiments, we used 2 GPUs, and a single GPU for diffusion models. For Style GAN2/3 models, we fine-tuned for, measured in thousands of real images fed to the discriminator, 80 thousand images. For Projected GAN, we fine-tuned for 40 thousand images. To improve FIDs, we only fine-tuned all GANs models for 12 thousand images. As for diffusion models in our experiments, we fine-tuned them for 128 20000 = 2560k images. Regarding the computation of pseudo density, we use n = 1 for all datasets and K = 10 for all except LSUN-Bedroom where we set K = 20. Per-sample perturbation hyper-parameters We employed PGD attack (Madry et al., 2018) whose hyper-parameters involve the number of steps K, the step size α, and the adversarial budget ϵ. For all GANs models in our experiments, we adopted K = 10, α = 0.025, and ϵ = 0.1. For diffusion models, we adopted K = 5, α = 0.0025, and ϵ = 0.0125. Density-based importance sampling hyper-parameters The two relevant hyper-parameters are the density threshold τ and the importance weight w of above-threshold samples. Their optimal values may vary across different datasets and models. We conducted a sweep for τ with values from {20, 50, 80} percentiles of the estimated densities of real images. We also performed sweeps for w with values from {0.01, 0.03, 0.1, 10, 33.0, 100}. We show the optimal values found in Table 2. When aiming to improve FIDs, we kept a density threshold of 50 percentile and performed importance sampling only on the real data. We conducted a sweep for w from {0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 1.05, 1.1, 1.3, 1.5, 1.7, 2.0} for GANs models and only {0.5, 2.0} for diffusion models.