Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Diffusion Beats Autoregressive in Data-Constrained Settings

Authors: Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closedform expression for the critical compute threshold at which diffusion begins to outperform AR. We train a total of 200 models 100 diffusion models and 100 autoregressive models across varying unique data sizes, model scales, and epoch counts. We present the empirical results in Section 3.1.
Researcher Affiliation	Collaboration	Mihir Prabhudesai Carnegie Mellon University Mengning Wu Carnegie Mellon University Amir Zadeh Lambda Katerina Fragkiadaki Carnegie Mellon University Deepak Pathak Carnegie Mellon University. Correspondence to EMAIL.
Pseudocode	Yes	Algorithm 1 Generating a Random Order List with Predefined Permutations. Algorithm 2 Shuffling Tokens Using Predefined Order Lists.
Open Source Code	Yes	Our code is available at: https://diffusion-scaling.github.io.
Open Datasets	Yes	We use the English C4 corpus [29], tokenized with the GPT-2 BPE vocabulary and truncated or padded to 2048 tokens per sequence. [29] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020.
Dataset Splits	No	The paper mentions 'validation loss' multiple times, implying a validation set is used, but it does not explicitly provide specific percentages, sample counts, or a detailed methodology for how the training, validation, and test splits were created from the C4 corpus or other datasets. It only mentions 'unique-token budgets of U {25, 50, 100}M' for training.
Hardware Specification	Yes	The results and models presented in this work also used compute resources from the National AI Research Resource Pilot, with support from NVIDIA, including NVIDIA s DGX Cloud product and the NVIDIA AI Enterprise Software Platform.
Software Dependencies	No	The paper states 'We adopt the Megatron-Deep Speed framework as the foundation of our implementation' but does not provide specific version numbers for Megatron-DeepSpeed or any other software libraries or programming languages used.
Experiment Setup	Yes	We use the following hyperparameters: batch size of 256 sequences, Adam W optimizer with β1=0.9, β2=0.95, ϵ=10 8, a learning rate schedule with peak 2e-4, minimum 2e-5, 1% warm-up, cosine decay, weight decay 0.1, and gradient clipping of 1.0.