Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling up Masked Diffusion Models on Text

Authors: Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, Chongxuan Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China 2Beijing Key Laboratory of Big Data Management and Analysis Methods 3Sea AI Lab, Singapore 4Singapore University of Technology and Design EMAIL; EMAIL; EMAIL; EMAIL; EMAIL
Pseudocode Yes Algorithm 1 Greedy sampling method of MDMs
Open Source Code Yes Our code is available at https://github.com/ML-GSAI/SMDM.
Open Datasets Yes We employ the open-source Slim Pajama dataset (Soboleva et al., 2023)... For simplicity and fairness, we employ the Llama-2 tokenizer (Touvron et al., 2023b)... We finetune MDM on the augmented training data (Deng et al., 2023) and test on GSM8K (Cobbe et al., 2021a) dataset... we fine-tune both models on the Share GPT dataset... We evaluate MDMs on the same reverse curse dataset used by Berglund et al. (2023)... we train both ARMs and MDMs on the Slim Pajama dataset (Soboleva et al., 2023)... and test them on the Fine Web dataset (Penedo et al., 2024)
Dataset Splits Yes We set the context length to 2048. Further implementation details are provided in Appendix B.2. To address this issue, we propose two mitigation strategies: (1) allocate a portion of training data with variable sequence lengths L U[1, 2048], where U[ ] denotes the uniform distribution; (2) pad sentences with mask tokens to reach 2048 tokens during evaluation. We extract the first 0.5 billion tokens from each period for evaluation.
Hardware Specification Yes All experiments in Table 4 are conducted on a single NVIDIA A100-40GB GPU.
Software Dependencies No The paper mentions various frameworks and optimizers used (Tiny Llama codebase, lm-eval, fast-chat, AdamW optimizer) but does not provide specific version numbers for any software components.
Experiment Setup Yes Consistency with Tiny LLama (Zhang et al., 2024), we utilize the Adam W optimizer (Loshchilov, 2017), setting β1 = 0.9, β2 = 0.95, and a weight decay of 0.1. Additionally, we apply a cosine learning rate schedule with a maximum learning rate of 4 10 4 and a minimum learning rate of 4 10 5 with 1% of the tokens for linear warmup. Notably, if the number of warmup steps is less than 100, it is set to 100. The batch size is set to 256. ... Specifically, we set the batch size to 384 and 1024 for the models trained with 1.6 1021 and 3.3 1021 FLOPs, respectively... we fine-tune the MDM on the augmented training data (Deng et al., 2023) for 40 epochs... we set the sampling steps to 256 and applying an unsupervised CFG scale of 0.1. ... we set the sequence length to 1024... train for 3 epochs...