Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test-Time Scaling of Diffusion Models via Noise Trajectory Search

Authors: Vignav Ramesh, Morteza Mardani

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on EDM and Stable Diffusion reveal state-of-the-art scores for class-conditioned/text-to-image generation, exceeding baselines by up to 164% and matching/exceeding MCTS performance.
Researcher Affiliation	Collaboration	Vignav Ramesh Harvard University EMAIL Morteza Mardani NVIDIA EMAIL
Pseudocode	Yes	Algorithm 1 ϵ-greedy noise search Algorithm 2 EDM Sampling Algorithm Algorithm 3 DDIM Sampling Algorithm Algorithm 4 MCTS noise search Algorithm 5 Rejection Sampling Algorithm 6 Beam Search Algorithm 7 Zero-Order Search
Open Source Code	Yes	Code will also be uploaded by supplementary material deadline, providing full reproducibility. Code and data are made publicly available at this Git Hub link.
Open Datasets	Yes	We evaluate our proposed test-time scaling approach on the Elucidated Diffusion Model (EDM) [Karras et al., 2022] for class-conditional image generation (on Image Net) and Stable Diffusion [Rombach et al., 2022] for text-to-image generation.
Dataset Splits	No	The paper uses pre-trained models (EDM, Stable Diffusion) and evaluates them on generated images. It specifies the number of generated images for evaluation (e.g., "generating 36 images given random Image Net class labels") but does not provide explicit train/test/validation splits for the data used in its own experiments.
Hardware Specification	Yes	Generating each sample took <1 second for naive sampling (the lowest-compute method) and <1 minute for MCTS (the highest-compute method) on a single A100 (40GB).
Software Dependencies	No	The paper mentions specific sampling algorithms like "EDM sampler (Alg. 2)" and "DDIM sampler (Alg. 3)" and the use of a "neural network Dθ". However, it does not provide specific version numbers for software libraries (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup	Yes	The number of denoising steps is fixed to 18. Unless otherwise specified, we use the classifier-free guidance 1.0, focusing on the simple conditional generation task without guidance [Ho and Salimans, 2022] . We set λ = 0.15 and ϵ = 0.4. Re. notation, N is branching factor, B is beam width, S is number of MCTS simulations, and K is the number of local search iterations. We enable classifier-free guidance (default scale 7.5) and set T = 50 here as is typical with DDIM sampling.