Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Theoretical Benefit and Limitation of Diffusion Language Model

Authors: Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To fully validate our theoretical ﬁndings, we conduct synthetic experiments and examine MDMs trained on formal languages, including n-gram languages and Hidden Markov Models (HMMs), systematically analyzing the relationship between performance and efﬁciency under both TER and SER metrics. All empirical results align with our theoretical predictions
Researcher Affiliation	Collaboration	1 State Key Laboratory of General Artiﬁcial Intelligence, Peking University 2 School of Mathematical Sciences, Peking University 3 Ant Group 4 Center for Machine Learning Research, Peking University
Pseudocode	Yes	Algorithm 1 Generate n-gram Language Model; Algorithm 2 Generate Hidden Markov Model
Open Source Code	No	We will open the code base and data when the paper is published.
Open Datasets	No	We evaluated MDMs on several formal languages: n-gram languages (with n {2, 3, 4}) and HMMs. For each language type, parameters (e.g., transition matrices, observation matrices, initial distributions) were randomly sampled. A detailed description of this generation process and examples of resulting sequences are available in Appendix F.1. These formal languages were used to generate datasets of 1,000,000 samples each, with 990,000 for training and 10,000 for validation.
Dataset Splits	Yes	These formal languages were used to generate datasets of 1,000,000 samples each, with 990,000 for training and 10,000 for validation. Datasets were generated with sequence lengths L {512, 1024, 2048}.
Hardware Specification	Yes	efﬁciency is deﬁned by the inverse of the execution time measured on 8 Nvidia RTX 4090 GPUs with Huggingface s transformers library; In our experiments of formal languages, all training was conducted on NVIDIA A100 GPUs.
Software Dependencies	No	The paper only mentions software names without version numbers, specifically 'Huggingface s transformers library' without a version.
Experiment Setup	Yes	Detailed architectural speciﬁcations, including layer counts, hidden dimensions, and positional encoding schemes, are provided in Table 7 (Appendix F.2). The training procedure largely followed the framework of Sahoo et al. (2024), with speciﬁc training conﬁgurations detailed in Table 8. Models were trained for 20 epochs, with convergence monitored on the validation set using perplexity.