Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ReDit: Reward Dithering for Improved LLM Policy Optimization
Authors: Chenxing Wei, Jiarui Yu, Ying He, Hande Dong, Yao SHU, Fei Richard Yu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across diverse tasks and different LLMs demonstrate the effectiveness and efficiency of Re Dit. On average, Re Dit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with Re Dit. Moreover, theoretical analyses are provided to further validate these advantages. 5 Empirical Results |
| Researcher Affiliation | Collaboration | Guangdong Lab of AI and Digital Economy (SZ), China College of Computer Science and Software Engineering, Shenzhen University, China Hong Kong University of Science and Technology (Guangzhou), China Tencent, Shenzhen, China School of Information Technology, Carleton University, Canada EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Re Dit within one optimization step |
| Open Source Code | Yes | Answer: [Yes] Justification: We provide the instruction and code in supplemental material. |
| Open Datasets | Yes | Our dataset selection and setup largely follow the methodology of [15], primarily to assess the mathematical reasoning capabilities of the models. This encompasses mathematical problem-solving datasets such as GSM8K [32] and MATH [42], as well as the multimodal geometric reasoning dataset Geometry3K [43]. |
| Dataset Splits | Yes | Table 3: Number of samples in the train, validation, and test datasets for various dateset. Number of samples train dataset validation dataset test dataset GSM8K 7473 1319 MATH 7506 5003 Geometry3K 2100 300 601 |
| Hardware Specification | Yes | All experiments were executed on one NVIDIA H20 GPU. |
| Software Dependencies | No | Our implementation leverages the official GRPO implementation within the TRL library [47]. Specific configurations for Lo RA and GRPO parameters are detailed in the Appendix D.3. |
| Experiment Setup | Yes | For parameter-efficient fine-tuning, we employed Low-Rank Adaptation (Lo RA) [46]. Our implementation leverages the official GRPO implementation within the TRL library [47]. Specific configurations for Lo RA and GRPO parameters are detailed in the Appendix D.3. Table 4: Lo RA Parameters Lo RA Target Lo RA Rank Lo RA Alpha Lo RA Dropout q & v Proj 8 64 0.05 Table 5: GRPO Parameters Learning Rate Num Generations Epochs Table 6: DAPO Parameters Clip Ratio Low Clip Ratio Low Clip Ratio C Num Generations Max 0.2 0.28 10.0 10 |