Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPMDM: Enhancing Masked Diffusion Models through Simplifying Sampling Path

Authors: Yichen Zhu, Weiyu Chen, James Kwok, Zhou Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic data and tasks like Countdown and Sudoku show SPMDM captures structural rules effectively, significantly outperforming existing MDMs and ARMs, with competitive results on broader reasoning benchmarks.
Researcher Affiliation	Academia	Yichen Zhu 1,3,* Weiyu Chen 2,* James Kwok 2 Zhou Zhao 1,3, 1Zhejiang University 2HKUST 3Shanghai Artificial Intelligence Laboratory EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Sampling 1: Input: Network pθ, subsequence length L, time [0, 1], sampling steps T, oracle function F 2: Initialize x1 {m}N, t 1, t 1/T 3: for n = 0 to T do 4: k : Count unmasked tokens nk 5: k : tk 1 nk/L 6: k : ˆxk 0 pθ( \| xk tk, x k t , tk) 7: if using adaptive sampling strategy then 8: Sample a set of masked token indices S = F(θ, xt) 9: (i, ℓ) S : xi,ℓ t = ˆxi,ℓ 0 10: else 11: k : sk max(tk t, 0) 12: k : For all masked tokens, with probability sk tk , ˆxk,ℓ 0 m 13: Update xt ˆx0 14: end if 15: end for 16: Return xt
Open Source Code	No	Answer: [No] Justification: Due to certain constraints, we are unable to release the code during the review period.
Open Datasets	Yes	Dataset. Countdown [1] is a mathematical reasoning task... Sudoku [2] is a classic logic-based number placement puzzle... We evaluate our model on a suite of challenging benchmarks spanning language understanding and reasoning. For common sense reasoning, we include four multiple-choice datasets: Hella Swag(HSwag) [50], Social IQA (SIQA) [38], Physical IQA (PIQA) [9], and Winogrande (Wino.) [37]... Additionally, we test on GSM8K [11]... we also use the advanced FineWeb2 corpus [31], which is derived from Common Crawl, as the training dataset for both MDLM and SPMDM.
Dataset Splits	Yes	Table 6: Dataset Deatails. Intra and Inter refer to toy datasets designed for intraand inter-subsequence modeling, respectively. CD is an abbreviation for Countdown.. Intra Inter CD3 CD4 CD5 Sudoku Train Entries 50k 50k 500k 500k 500k 100k Test Entries 1k 1k 1k 1k 1k 1k... For intraand inter-subsequence modeling, we randomly generate 50,000 samples for training and 1000 samples for testing, respectively.
Hardware Specification	Yes	We conduct all toy example experiments using four RTX 4090 GPUs. ... We conduct all experiments related to problem-solving tasks using eight RTX 4090 GPUs. ... Training and sampling are conducted on eight A100 GPUs with 40GB of memory.
Software Dependencies	No	MDLM [36], BDLM [4], and SPMDM are all implemented using a tiny model with 6M parameters... Both ARMs and MDMs are implemented based on the GPT-2 architecture.
Experiment Setup	Yes	We use a learning rate of 1 10 3 and a batch size of 1024. All models are trained for 10 epochs on the training set. Additionally, the number of sampling steps is fixed to 32 for all models. ... Across all datasets, we use a learning rate of 1 10 3 for the 6M-parameter tiny models and 3 10 4 for the 85M-parameter models. The batch size is set to 512. For the countdown task, we train for 150 epochs, and for the sudoku task, we train for 100 epochs. ... For models with 127M and 355M parameters, we use a learning rate of 3 10 4 with a cosine scheduler. The batch size is set to 512, and training is performed for a total of 400K iterations. During inference, the number of sampling steps is fixed to 256.