Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Diversity-Aware Policy Optimization for Large Language Model Reasoning
Authors: Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, KC Tan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential@k (a novel metric quantifying an LLM s reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Integrated into the R1-zero training framework, our method achieves a 3.5% average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions. To summarize, our key contributions are: We evaluate our method on four mathematical reasoning benchmarks, each comprising at least 500 problems with stable evaluation metrics. Our method achieves a 3.5% average improvement over standard R1-zero training and consistently produces more diverse solutions. 5 Experiments |
| Researcher Affiliation | Academia | Jian Yao1, Ran Cheng1,2,3 , Xingyu Wu1, Jibin Wu1,2, Kay Chen Tan1 1 Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University 2 Department of Computing, The Hong Kong Polytechnic University 3 The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and formulas but does not include a clearly labeled pseudocode or algorithm block with structured steps. |
| Open Source Code | Yes | The code is available at https://github.com/nigelyaoj/R1_zero_Div. |
| Open Datasets | Yes | Benchmarks We selected 4 mathematical benchmarks to evaluate the models reasoning abilities: GSM8K [7], MATH500 [16], Olympiad Bench [14], and College Math [42]. Each contains at least 500 data points for testing. The data we use for the experiments are all from open-access datasets. |
| Dataset Splits | Yes | We train the base model on the GSM8K training set and then evaluate on the 4 benchmarks. |
| Hardware Specification | Yes | all deployed on 8 NVIDIA A6000 GPUs. |
| Software Dependencies | Yes | For training R1-zero and R1-zero-Div, the codebase runs on Python 3.11, utilizing TRL 0.16.0 [46] with Py Torch 2.5.1. We employ Deep Speed [39] for distributed training and incorporate v LLM 0.7.2 [27] for efficient rollout, all deployed on 8 NVIDIA A6000 GPUs. |
| Experiment Setup | Yes | We provide the system prompt in Figure 2 and other detailed hyperparameter settings in Table 6. Due to computational resource constraints, we train on the simpler dataset (GSM8K), which allows for a shorter maximum response length, and use a well-designed prompt to obtain a stronger initial checkpoint. The experiment settings for R1-zero and R1-zero-Div are the same except for λ = 0 in R1-zero and λ = 0.01 in R1-zero-Div. |