Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GPO: Learning from Critical Steps to Improve LLM Reasoning
Authors: Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process. |
| Researcher Affiliation | Collaboration | Jiahao Yu Department of Computer Science Northwestern University EMAIL Cheng AI Foundations Capital One Department of Computer Science Northwestern University EMAIL Wu Meta AI EMAIL Xing Department of Computer Science Northwestern University EMAIL |
| Pseudocode | Yes | Algorithm 1: GPO Optimization Framework |
| Open Source Code | Yes | To improve transparency and inspire future research, we also release the code and data3 to facilitate reproducibility and further research. 3https://github.com/sherdencooper/GPO |
| Open Datasets | Yes | We conduct extensive experiments on 7 diverse datasets, including general reasoning, mathematical problem solving, and STEM tasks. For mathematical problem solving, we use GSM8K [57], MATH-500 [58], AIME-2024 [59], and AIME-2025 [60]. For general reasoning, we utilize BIGBench Hard (BBH) [61]. For STEM problem solving, we employ MMLU [62] and MMLUPro [63]. |
| Dataset Splits | Yes | Standard train/test splits are used for GSM8K, MATH, MMLU, and MMLUPro. Following prior work [19, 64, 65], the AIME training set consists of problems from 1983-2023. For BBH, we randomly split the dataset into training set and test set by sub-task. Further dataset statistics can be found in E.1. ... Table 2: Dataset Statistics: The table presents the number of training and test samples for each dataset, along with the source of the dataset. |
| Hardware Specification | Yes | The experiments are conducted on a server equipped with four AMD EPYC 7702 64-Core CPU Processors and eight NVIDIA H100 80GB GPUs. |
| Software Dependencies | No | The paper mentions 'Our training process primarily utilizes the LLa MA-Factory framework [72].' but does not specify its version or provide other software dependencies with version numbers. |
| Experiment Setup | Yes | We list the main hyper-parameters for the experiments in Table 3. The GPO-enhanced methods are trained with the same hyper-parameters as the baseline methods. For those unmentioned hyperparameters, we use the default values provided by the LLa MA-Factory framework [72]. Table 3: Hyper-parameters for Different Datasets and Methods. |