Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling up Masked Diffusion Models on Text

Authors: Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, Chongxuan Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China 2Beijing Key Laboratory of Big Data Management and Analysis Methods 3Sea AI Lab, Singapore 4Singapore University of Technology and Design EMAIL; EMAIL; EMAIL; EMAIL; EMAIL
Pseudocode	Yes	Algorithm 1 Greedy sampling method of MDMs
Open Source Code	Yes	Our code is available at https://github.com/ML-GSAI/SMDM.
Open Datasets	Yes	We employ the open-source Slim Pajama dataset (Soboleva et al., 2023)... For simplicity and fairness, we employ the Llama-2 tokenizer (Touvron et al., 2023b)... We finetune MDM on the augmented training data (Deng et al., 2023) and test on GSM8K (Cobbe et al., 2021a) dataset... we fine-tune both models on the Share GPT dataset... We evaluate MDMs on the same reverse curse dataset used by Berglund et al. (2023)... we train both ARMs and MDMs on the Slim Pajama dataset (Soboleva et al., 2023)... and test them on the Fine Web dataset (Penedo et al., 2024)
Dataset Splits	Yes	We set the context length to 2048. Further implementation details are provided in Appendix B.2. To address this issue, we propose two mitigation strategies: (1) allocate a portion of training data with variable sequence lengths L U[1, 2048], where U[ ] denotes the uniform distribution; (2) pad sentences with mask tokens to reach 2048 tokens during evaluation. We extract the first 0.5 billion tokens from each period for evaluation.
Hardware Specification	Yes	All experiments in Table 4 are conducted on a single NVIDIA A100-40GB GPU.
Software Dependencies	No	The paper mentions various frameworks and optimizers used (Tiny Llama codebase, lm-eval, fast-chat, AdamW optimizer) but does not provide specific version numbers for any software components.
Experiment Setup	Yes	Consistency with Tiny LLama (Zhang et al., 2024), we utilize the Adam W optimizer (Loshchilov, 2017), setting β1 = 0.9, β2 = 0.95, and a weight decay of 0.1. Additionally, we apply a cosine learning rate schedule with a maximum learning rate of 4 10 4 and a minimum learning rate of 4 10 5 with 1% of the tokens for linear warmup. Notably, if the number of warmup steps is less than 100, it is set to 100. The batch size is set to 256. ... Specifically, we set the batch size to 384 and 1024 for the models trained with 1.6 1021 and 3.3 1021 FLOPs, respectively... we fine-tune the MDM on the augmented training data (Deng et al., 2023) for 40 epochs... we set the sampling steps to 256 and applying an unsupervised CFG scale of 0.1. ... we set the sequence length to 1024... train for 3 epochs...