Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Authors: Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that Reason Flux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, Reason Flux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
Researcher Affiliation	Collaboration	Jiaru Zou1 , Ling Yang2,4 , Jingwen Gu3 , Jiahao Qiu2, Ke Shen4, Jingrui He1, Mengdi Wang2 1UIUC 2Princeton University 3Cornell University 4Byte Dance Seed
Pseudocode	No	The paper describes its methodology using textual explanations and mathematical equations (e.g., Eq. 1, 2, 4-14) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code: Reason Flux-PRM-Code, Models: Reason Flux-PRM-1.5B/7B Our code implementation is submitted along with the manuscript.
Open Datasets	Yes	We evaluate Reason Flux-PRM on four representative and challenging reasoning benchmarks, including MATH500 [13], a diverse set of 500 mathematical problems of varying difficulty; AIME24 [38], consisting of 30 problems from the 2024 American Invitational Mathematics Examination (AIME); AIME25, which includes 15 problems from the 2025 AIME [37]; and GPQA-Diamond [46], a benchmark of 198 Ph D-level science questions to assess advanced scientific reasoning. The training data is primarily sourced from the public trajectory-response reasoning traces such as Open Thoughts-114K [53].
Dataset Splits	No	For offline data selection and subsequent supervised fine-tuning... We then rank all samples based on their aggregated reward scores and select the top 1,000 examples to serve as the training set for downstream fine-tuning. For online policy optimization, we use a training dataset comprising 10k competition-level mathematical reasoning problems collected from MATH [13] and the DAPO [74] training set. The paper describes the selection of training data from larger pools and the use of specific benchmarks for evaluation, but it does not provide explicit train/validation/test splits for these benchmarks themselves. For example, it lists the sizes of the evaluation benchmarks (e.g., MATH500 with 500 problems), implying they are used as test sets, but does not detail how they were split from a larger dataset or if standard splits were used with specific proportions or numbers for train/val/test.
Hardware Specification	Yes	All experiments are conducted on 8 A100 GPUs. All experiments are conducted on a server node with 8 A100-80G GPUs.
Software Dependencies	No	The paper mentions using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct models, as well as the Hugging Face GRPO Trainer [57], but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA, or other libraries.
Experiment Setup	Yes	We fine-tune the model for 5 epochs using a learning rate of 1e 5, weight decay of 1e 4, and a maximum sequence length of 32,768. We train with a batch size of 32, generating 6 samples per prompt, and run training for 3 epochs. To train our reward model, we use a learning rate of 1e-5 and train for 3 epochs. For the Best-of-N test-time scaling experiments...nucleus sampling with temperature T = 0.3, where N {2, 4, 8, 16}.