Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization
Authors: Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various tasks including mathematical reasoning and free-form natural reasoning are conducted to validate the proposed method. Our contributions are summarized as follows: Experiments on both math reasoning tasks with deterministic golden answers and free-form natural reasoning tasks are conducted to validate the efficacy and versatility of EMPO. |
| Researcher Affiliation | Collaboration | Qingyang Zhang Tianjin University Haitao Wu Tianjin University Changqing Zhang Tianjin University Peilin Zhao Shanghai Jiao Tong University & Tencent AI Lab Yatao Bian National University of Singapore & Tencent AI Lab |
| Pseudocode | Yes | Algorithm 1: Semantic Clustering... Algorithm 2: Implementation of verifier for mathematical reasoning tasks. ... Algorithm 3: Implementation of verifier for natural reasoning tasks. ... Algorithm 4: Python code of data filtering in a huggingface-like style. |
| Open Source Code | Yes | Code is publicly available at https://github.com/Qingyang Zhang/EMPO. |
| Open Datasets | Yes | For mathematical reasoning, following the common practice [30, 8, 31], we adopt 20,000 prompts randomly selected from Numina Math-Co T dataset [32] for training5 without additional data engineering. For free-form natural reasoning tasks, we adopt the prompts from Natural Reasoning6, a large-scale dataset consisting of diverse reasoning questions from multiple domains (e.g., Physics, Computer Science, Economics, Social Sciences and more). For training efficiency, we filter out the questions with over-long prompt or reference answer. ... Evaluation. For mathematical reasoning, the performance is evaluated on a diverse suite of benchmarks including Minerva Math, MATH, AMC23, Olympaid Bench and AIME24. For free-form natural reasoning, we evaluate on MMLU-Pro [36] and GPQA [37] benchmarks |
| Dataset Splits | Yes | For mathematical reasoning, following the common practice [30, 8, 31], we adopt 20,000 prompts randomly selected from Numina Math-Co T dataset [32] for training5 without additional data engineering. ... The final training subset is consisted of 18,000 questions. |
| Hardware Specification | Yes | This adjustment was made to fit the limited GPU memory of one single 8 A100 80G machine. |
| Software Dependencies | No | We implement GRPO via verl [39]. We train models by supervised finetuning via Open-Instruct [38]... For more general free-form natural reasoning, we leverage General-Verifier7 (a compact small language model with 1.5B parameters) ...from math_verify import parse, verify... from datasets import load_dataset... from_pretrained(...) |
| Experiment Setup | Yes | SFT: We train models by supervised finetuning via Open-Instruct [38] with a fixed learning rate of 1 10 6, a global batch size of 128 and train for 1 epoch with a max length of 2048. GRPO: We sample 16 and 12 responses for each prompt for mathematical and natural reasoning tasks respectively. For Qwen2.5-Math model series, we train the model for 300 steps with a maximum generation length of 3096. For Octo Thinker model series, we train the model for 100 steps with a maximum generation length of 16K. We adopt a train prompt batch size of 256 and mini-batch size of 32. ... EMPO: Most hyper-parameters of our method, e.g., number of generations, max generation length, batch size, learning rate are the same with GRPO. ... We provide a brief summary of our training recipes in Table 5. |