Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GPO: Learning from Critical Steps to Improve LLM Reasoning

Authors: Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.
Researcher Affiliation	Collaboration	Jiahao Yu Department of Computer Science Northwestern University EMAIL Cheng AI Foundations Capital One Department of Computer Science Northwestern University EMAIL Wu Meta AI EMAIL Xing Department of Computer Science Northwestern University EMAIL
Pseudocode	Yes	Algorithm 1: GPO Optimization Framework
Open Source Code	Yes	To improve transparency and inspire future research, we also release the code and data3 to facilitate reproducibility and further research. 3https://github.com/sherdencooper/GPO
Open Datasets	Yes	We conduct extensive experiments on 7 diverse datasets, including general reasoning, mathematical problem solving, and STEM tasks. For mathematical problem solving, we use GSM8K [57], MATH-500 [58], AIME-2024 [59], and AIME-2025 [60]. For general reasoning, we utilize BIGBench Hard (BBH) [61]. For STEM problem solving, we employ MMLU [62] and MMLUPro [63].
Dataset Splits	Yes	Standard train/test splits are used for GSM8K, MATH, MMLU, and MMLUPro. Following prior work [19, 64, 65], the AIME training set consists of problems from 1983-2023. For BBH, we randomly split the dataset into training set and test set by sub-task. Further dataset statistics can be found in E.1. ... Table 2: Dataset Statistics: The table presents the number of training and test samples for each dataset, along with the source of the dataset.
Hardware Specification	Yes	The experiments are conducted on a server equipped with four AMD EPYC 7702 64-Core CPU Processors and eight NVIDIA H100 80GB GPUs.
Software Dependencies	No	The paper mentions 'Our training process primarily utilizes the LLa MA-Factory framework [72].' but does not specify its version or provide other software dependencies with version numbers.
Experiment Setup	Yes	We list the main hyper-parameters for the experiments in Table 3. The GPO-enhanced methods are trained with the same hyper-parameters as the baseline methods. For those unmentioned hyperparameters, we use the default values provided by the LLa MA-Factory framework [72]. Table 3: Hyper-parameters for Different Datasets and Methods.