Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReDit: Reward Dithering for Improved LLM Policy Optimization

Authors: Chenxing Wei, Jiarui Yu, Ying He, Hande Dong, Yao SHU, Fei Richard Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across diverse tasks and different LLMs demonstrate the effectiveness and efficiency of Re Dit. On average, Re Dit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with Re Dit. Moreover, theoretical analyses are provided to further validate these advantages. 5 Empirical Results
Researcher Affiliation Collaboration Guangdong Lab of AI and Digital Economy (SZ), China College of Computer Science and Software Engineering, Shenzhen University, China Hong Kong University of Science and Technology (Guangzhou), China Tencent, Shenzhen, China School of Information Technology, Carleton University, Canada EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Re Dit within one optimization step
Open Source Code Yes Answer: [Yes] Justification: We provide the instruction and code in supplemental material.
Open Datasets Yes Our dataset selection and setup largely follow the methodology of [15], primarily to assess the mathematical reasoning capabilities of the models. This encompasses mathematical problem-solving datasets such as GSM8K [32] and MATH [42], as well as the multimodal geometric reasoning dataset Geometry3K [43].
Dataset Splits Yes Table 3: Number of samples in the train, validation, and test datasets for various dateset. Number of samples train dataset validation dataset test dataset GSM8K 7473 1319 MATH 7506 5003 Geometry3K 2100 300 601
Hardware Specification Yes All experiments were executed on one NVIDIA H20 GPU.
Software Dependencies No Our implementation leverages the official GRPO implementation within the TRL library [47]. Specific configurations for Lo RA and GRPO parameters are detailed in the Appendix D.3.
Experiment Setup Yes For parameter-efficient fine-tuning, we employed Low-Rank Adaptation (Lo RA) [46]. Our implementation leverages the official GRPO implementation within the TRL library [47]. Specific configurations for Lo RA and GRPO parameters are detailed in the Appendix D.3. Table 4: Lo RA Parameters Lo RA Target Lo RA Rank Lo RA Alpha Lo RA Dropout q & v Proj 8 64 0.05 Table 5: GRPO Parameters Learning Rate Num Generations Epochs Table 6: DAPO Parameters Clip Ratio Low Clip Ratio Low Clip Ratio C Num Generations Max 0.2 0.28 10.0 10