Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Locality in Image Diffusion Models Emerges from Data Statistics

Authors: Artem Lukoianov, Chenyang Yuan, Justin M Solomon, Vincent Sitzmann

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We rigorously benchmark recent analytical models by how well they match generations by a trained deep diffusion model. Surprisingly, we find that a simple Wiener filter outperforms all recent analytical methods based on modifications of the optimal denoiser. Integrating our analytically-derived sensitivity fields into the model of Kamb and Ganguli [12], however, yields the best-performing analytical diffusion model to date across multiple datasets, including CIFAR10 [16], AFHQv2 [4], and Celeb A-HQ [13]. In this section, we perform extensive validation to support our claims.
Researcher Affiliation	Collaboration	Artem Lukoianov Massachusetts Institute of Technology EMAIL Chenyang Yuan Toyota Research Institute EMAIL Justin Solomon Massachusetts Institute of Technology EMAIL Vincent Sitzmann Massachusetts Institute of Technology EMAIL
Pseudocode	Yes	Algorithm 1 Single denoising step of the proposed analytical model.
Open Source Code	Yes	5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will include the code required to reproduce all of our experiments in the supplementary materials.
Open Datasets	Yes	Integrating our analytically-derived sensitivity fields into the model of Kamb and Ganguli [12], however, yields the best-performing analytical diffusion model to date across multiple datasets, including CIFAR10 [16], AFHQv2 [4], and Celeb A-HQ [13]. For comparison, we chose five datasets with diverse sets of statistics: CIFAR10 [16], a dataset of diverse 32 x 32 natural images; Celeb A-HQ [13] and AFHQv2 [3], datasets of centered faces and animals in 64 x 64; and MNIST [6] and Fashion MNIST [32], datasets of binary centered images in 28 x 28 resolution.
Dataset Splits	No	The paper does not explicitly provide specific details about training/test/validation dataset splits, such as percentages, absolute counts, or references to specific split files for their experiments. While it uses well-known datasets like CIFAR10, Celeb A-HQ, MNIST, AFHQv2, and Fashion MNIST, which commonly have standard splits, the paper does not specify how the data was partitioned for its own experimental setup.
Hardware Specification	Yes	All the experiments were performed on a server machine with Ubuntu 20.04. The machine has 1008GB RAM, 128 CPU cores and 8 NVIDIA RTX A6000 GPUs with 49140MB VRAM.
Software Dependencies	No	In all of the generations in this paper, we are using diffusers [9] implementation of the DDIM [27] sampler with 10 sampling steps. We train a Denoising Diffusion Probabilistic Model (DDPM) U-Net using a third-party pytorch implementation [35]. The paper mentions software tools like 'diffusers' and 'pytorch' but does not provide specific version numbers for these or any other software components (e.g., Python, CUDA) required to replicate the experiment.
Experiment Setup	Yes	The number of residual blocks per level is fixed to 2, with no self-attention modules included. Dropout is set to 0.15 throughout the network. The model is trained for 200 epochs with a batch size of 32. We use the Adam optimizer with a learning rate of 10-4 over 1000 diffusion steps. Training and evaluation use fixed random seeds for reproducibility.