Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Masked Diffusion Models as Energy Minimization

Authors: Sitong Chen, Shen Nie, Jiacheng Sun, Zijin Feng, Zhenguo Li, Ji-Rong Wen, Chongxuan LI

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
Researcher Affiliation	Collaboration	1 Gaoling School of Artiﬁcial Intelligence, Renmin University of China; 4 Huawei Noah s Ark Lab; EMAIL; EMAIL
Pseudocode	No	The paper describes methods and derivations using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	No	While the paper mentions leveraging open-source pretrained weights and evaluation toolkits (LLa DA [43]), it does not explicitly state that the authors' own implementation code for their methodology is open-sourced or provide a direct link to a code repository.
Open Datasets	Yes	We select six representative tasks: MBPP [7], Human Eval [9], BBH [28], GSM8K [10], Hendrycks Math [11] and Minerva Math [18].
Dataset Splits	Yes	To validate this hypothesis, we conducted the following experiments demonstrating that randomly chosen small subsets (50-150 instances) of test data sufﬁce for reliable schedule selection. Speciﬁcally, we compare schedule performance between small test subsets and full evaluations in Table 1. Task: GSM8K ( ) (length=128, steps=32) Random subset 1 (n=132) 44.70 43.94 31.06 0.00 Random subset 2 (n=132) 46.97 41.67 40.91 0.00 Full test set (n=1319) 38.06 34.80 29.04 0.08 Task: Human Eval ( ) (length=256, steps=64) Random subset 1 (n=82) 8.54 20.73 26.83 1.22 Random subset 2 (n=82) 18.29 24.39 30.49 2.44 Full test set (n=164) 11.59 22.56 24.39 1.83
Hardware Specification	Yes	All experiments can be efﬁciently conducted on a single NVIDIA A800 GPU.
Software Dependencies	No	The paper mentions leveraging an 'open-source pretrained weights and evaluation toolkit from LLa DA [43]' but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	For details on how to identify a task-favorable schedule by tuning the beta parameters, please refer to Appendix E.1. ... All experiments ﬁx generation length at 256 and higher values indicate better sampling quality. ... We recommend conducting initial grid searches using small random test data subsets (about 50 150 instances) across parameters a, b {0.1, 0.2, ..., 1.0} to ﬁnd a set of wellperformed schedules.