Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Diffusion Models

Authors: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan LI

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across extensive benchmarks on general tasks, math, code, and so on, LLa DA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines.
Researcher Affiliation	Collaboration	Shen Nie1,2,3 Fengqi Zhu1,2,3 Zebin You1,2,3 Xiaolu Zhang4 Jingyang Ou1,2,3 Jun Hu4 Jun Zhou4 Yankai Lin1,2,3 Ji-Rong Wen1,2,3 Chongxuan Li1,2,3 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Beijing Key Laboratory of Research on Large Models and Intelligent Governance 3 Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE 4 Ant Group EMAIL
Pseudocode	Yes	A.3 Algorithms In this section, we present the training and inference algorithms. Specifically, we introduce the pre-training and supervised fine-tuning algorithms in Algorithm 1 and Algorithm 2, respectively. In addition, the likelihood evaluation algorithm is provided in Algorithm 3. Finally, we present the reverse generation process in Algorithm 4 and Algorithm 5, which correspond to the random remasking and the low-confidence [23] remasking strategy, respectively.
Open Source Code	No	In addition, we will release the model weights and evaluation code upon acceptance.
Open Datasets	Yes	We evaluate the scalability, instruction-following, and in-context learning capabilities of LLa DA on standard benchmarks, followed by analyses and case studies to provide a comprehensive assessment.
Dataset Splits	Yes	For all the aforementioned benchmarks, we follow the widely adopted evaluation process [125] used in LLM assessments, primarily employing conditional likelihood estimation and conditional generation.
Hardware Specification	Yes	LLa DA 8B was pre-trained from scratch on 2.3 trillion tokens using 0.13 million H800 GPU hours, followed by SFT on 4.5 million pairs.
Software Dependencies	No	The paper mentions several techniques and models like Warmup-Stable-Decay [28], Adam W optimizer [29], Transformer [7], RMSNorm [105], Swi GLU [106], and Ro PE [107], but does not provide specific version numbers for these or any underlying software libraries.
Experiment Setup	Yes	Specifically, we linearly increased the learning rate from 0 to 4 10 4 over the first 2000 iterations and maintained it at 4 10 4. After processing 1.2T tokens, we decayed the learning rate to 1 10 4 and held it constant for the next 0.8T tokens to ensure stable training. Finally, we linearly reduced the learning rate from 1 10 4 to 1 10 5 for the last 0.3T tokens. Furthermore, we utilized the Adam W optimizer [29] with a weight decay of 0.1, a batch size of 1280, and a local batch size of 4 per GPU.