Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Is Your Diffusion Model Actually Denoising?

Authors: Daniel Pfrommer, Zehao Dou, Christopher Scarvelis, Max Simchowitz, Ali Jadbabaie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments broadly demonstrate that Schedule Deviation is prevalent across all datasets. We visualize the total Schedule Deviation over z Z in Figure 3 and Figure 4 for each of our datasets with varying subsets of the training data. ... We evaluate the schedule deviation of trained neural networks in two distinct settings and 3 datasets...
Researcher Affiliation	Academia	Daniel Pfrommer MIT Cambridge, MA 02139 EMAIL Zehao Dou Yale University New Haven, CT 06520 EMAIL Christopher Scarvelis MIT Cambridge, MA 02139 EMAIL Max Simchowitz CMU Pittsburgh, PA 15213 EMAIL Ali Jadbabaie MIT Cambridge, MA 02139 EMAIL
Pseudocode	Yes	Algorithm 1 Schedule Deviation
Open Source Code	Yes	Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We include the full code from running sweeps and measuring the schedule deviation in the supplementary material.
Open Datasets	Yes	We principally consider three datasets: conditional MNIST [Le Cun et al., 1998] (left), conditional Fashion-MNIST [Xiao et al., 2017] (middle), and endpoint-conditional maze path generation (right). ... We additionally consider an ablation on the Celeb A [Liu et al., 2015] dataset...
Dataset Splits	No	For t-SNE-conditional MNIST generation, we evaluate the Schedule Deviation and empirical 1-Wassertstein Distance between DDPM/DDIM samples, ablated over the training dataset size N {10000, 30000, 60000}. ... We use training datasets of size N {50000, 100000, 160000} for our experiments.
Hardware Specification	Yes	All experiments were performed using a cluster of 4 NVIDIA A100 GPUs and took approximately 100 GPU/hrs of compute to train and evaluate all visualized experiments.
Software Dependencies	No	Computations were performed using the Python Optimal Transport toolbox. We used the exact LP-based solution, as opposed to e.g. entropic Optimal Transport using Sinkhorn. ... For both MNIST/Fashion-MNIST we train the model using Adam W (with weight decay 1 10 4) and a cosine decay schedule...
Experiment Setup	Yes	We use a U-Net architecture similar to Dhariwal and Nichol [2021] for all experiments. For full experiment details, see Appendix C. ... For all experiments we used the ϵ-parameterization introduced in Ho et al. [2020] and a variance-exploding" setup for the Diffusion Schedule as detailed in Appendix B. In particular, we use a log-linear noise schedule where σ(s) = c1ec2s, with 512 training timesteps (and 64 sampling timesteps) ranging from σ = 5 10 4 to σ = 5 for the experiments in Section 3 and σ = 0.01 to σ = 35 for the Celeb A experiments. ... For both MNIST/Fashion-MNIST we train the model using Adam W (with weight decay 1 10 4) and a cosine decay schedule with an initial learning rate of 3 10 4 over 300, 000 total training iterations and a batch size of 256 samples.