Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Authors: Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCo LT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCo LT-reinforced LLa DA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and Human Eval.
Researcher Affiliation Collaboration 1Zhejiang Univeristy 2MAPLE Lab, Westlake University 3Matterwave Intelligence 4Institute of Advanced Technology, Westlake Institute for Advanced Study EMAIL, EMAIL
Pseudocode Yes Algorithm 1 A General Framework for Training DCo LT
Open Source Code Yes https://github.com/maple-research-lab/LLaDOU
Open Datasets Yes Table 11: Reference assets and their licenses. Asset License Utility SEDD [24] MIT Code & Model GSM8K-Aug [10] Data LLa DA [27] MIT Code & Model MATH [16] MIT Data GSM8K [8] MIT Data Kod Code [41] CC BY-NC 4.0 Data
Dataset Splits Yes For GSM8K, there are 7.5K questions for training and 1.32K questions for testing. For MATH, there are 7.5K questions for training and 5K questions for testing.
Hardware Specification Yes Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCo LT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both.
Software Dependencies No The paper mentions using Adam W optimizer but does not specify version numbers for key software components like deep learning frameworks (e.g., PyTorch, TensorFlow) or programming language versions.
Experiment Setup Yes The model is trained with 64 prompts in a batch, each generating 16 completions to form a group for advantage calculation. We take an Adam W optimizer with a learning rate of 5e-6, and (β1, β2) = (0.9, 0.999). We do not apply the KL penalty by default, as it provides marginal benefits in our experiments. The whole training lasts for 140 iterations on 16 H800 GPUs, which takes about 63 GPU days (i.e., about 4 days on wall clock with 16 GPUs).