Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization
Authors: Chenglong Wang, Yang Gan, Hang Zhou, Chi Hu, Yongyu Mu, Kai Song, MuRun Yang, Bei Li, Chunliang Zhang, Tongran Liu, JingBo Zhu, Zhengtao Yu, Tong Xiao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant denoising speedups while maintaining high performance across reasoning tasks. We evaluate the effectiveness of our multi-reward optimization (MRO) approach with various optimization algorithms, including test-time scaling, rejection sampling, and reinforcement learning. We conducted our experiments using the LLa DA-8B-Instruct model. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Northeastern University, Shenyang, China 2Byte Dance 3Niu Trans Research, Shenyang, China 4CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China 5Kunming University of Science and Technology |
| Pseudocode | Yes | Algorithm 1 Simplified Grouped Reward Optimization (SGRO) |
| Open Source Code | Yes | Our codebase could be found at https://github.com/wangclnlp/MRO. |
| Open Datasets | Yes | We considered five reasoning benchmarks across three categories: (1) Mathematical reasoning: GSM8K and MATH500; (2) Scientific reasoning: GPQA, which focuses on biology, physics, and chemistry reasoning; (3) Logical reasoning: 4x4 Sudoku and the Countdown task with 3 numbers. More experimental details can be found in Appendix C. For both rejection sampling and reinforcement learning, we utilized Deep Scale R [55] in conjunction with the 10k Countdown5 and Sudoku6 datasets. These datasets were randomly shuffled to ensure a well-balanced data distribution. |
| Dataset Splits | No | For GSM8K and MATH500, we use 4-shot, while for GPQA, Countdown, and Sudoku, we use 5-shot. More specifically, during the reinforcement learning training, we use the GSM8K9 and MATH50010 training sets to perform the MRO, respectively. During training, we performed model validation every 50 steps and selected the best model based on performance on the validation set as our final model. |
| Hardware Specification | No | The models in this paper are all public, and only the inference is needed. The computer resources for running these models are well known. |
| Software Dependencies | No | Here, we use lmppl3 to implement it. |
| Experiment Setup | Yes | For training LLa DA-s1, we used a pre-trained version of LLa DA. The learning rate was set to 2e-5. We trained this model on the s1 dataset for 3 epochs. For rejection sampling and reinforcement learning, we set the learning rate to 2e-6. During training, we performed model validation every 50 steps and selected the best model based on performance on the validation set as our final model. For computing Rtv t , we sampled one token from the predicted masked tokens at each denoising step. For Rppl t , we set Cppl and Fppl to 100 and 100, respectively. The temperature was set to 0.25. |