Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
Authors: Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCo LT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCo LT-reinforced LLa DA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and Human Eval. |
| Researcher Affiliation | Collaboration | 1Zhejiang Univeristy 2MAPLE Lab, Westlake University 3Matterwave Intelligence 4Institute of Advanced Technology, Westlake Institute for Advanced Study EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 A General Framework for Training DCo LT |
| Open Source Code | Yes | https://github.com/maple-research-lab/LLaDOU |
| Open Datasets | Yes | Table 11: Reference assets and their licenses. Asset License Utility SEDD [24] MIT Code & Model GSM8K-Aug [10] Data LLa DA [27] MIT Code & Model MATH [16] MIT Data GSM8K [8] MIT Data Kod Code [41] CC BY-NC 4.0 Data |
| Dataset Splits | Yes | For GSM8K, there are 7.5K questions for training and 1.32K questions for testing. For MATH, there are 7.5K questions for training and 5K questions for testing. |
| Hardware Specification | Yes | Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCo LT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. |
| Software Dependencies | No | The paper mentions using Adam W optimizer but does not specify version numbers for key software components like deep learning frameworks (e.g., PyTorch, TensorFlow) or programming language versions. |
| Experiment Setup | Yes | The model is trained with 64 prompts in a batch, each generating 16 completions to form a group for advantage calculation. We take an Adam W optimizer with a learning rate of 5e-6, and (β1, β2) = (0.9, 0.999). We do not apply the KL penalty by default, as it provides marginal benefits in our experiments. The whole training lasts for 140 iterations on 16 H800 GPUs, which takes about 63 GPU days (i.e., about 4 days on wall clock with 16 GPUs). |