Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Authors: Yang Xiao, Jiashuo WANG, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9% to +6.6%) with significantly reduced token usage (-3% to -41%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoningcapable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints. Code and dataset are available at the LIMOPro.
Researcher Affiliation	Academia	1The Hong Kong Polytechnic University 2Shanghai Jiao Tong University
Pseudocode	Yes	Algorithm 1 outlines the complete pipeline of our Perplexity-based Importance Refinement (PIR) framework.
Open Source Code	Yes	Code and dataset are available at the LIMOPro.
Open Datasets	Yes	Code and dataset are available at the LIMOPro.
Dataset Splits	No	The paper refers to 'Training Dataset' and 'Benchmark Datasets' but does not provide specific details on the train/validation/test splits for the datasets (S1K, LIMO, LIMO-V2) used for fine-tuning, nor does it cite predefined splits for these particular training data applications. It mentions 'Benchmark Datasets' for evaluation, implying they are test sets, but no explicit splits for the training data are given.
Hardware Specification	No	The paper mentions using models like Qwen2.5-32B-Instruct for perplexity calculation and fine-tuning but does not specify the hardware (e.g., GPU models, CPU types, memory) on which these operations were performed. The NeurIPS checklist for item 8 also points to a GitHub link for these settings, suggesting they are not in the paper itself.
Software Dependencies	No	The paper mentions specific models like 'Claude 3.7 Sonnet' and 'Qwen2.5-32B-Instruct' and 'Qwen2.5-Math evaluators' for tasks like segmentation, classification, and evaluation. However, it does not provide a list of ancillary software dependencies (e.g., programming languages, libraries, or frameworks) with specific version numbers that would be required to reproduce the experiments.
Experiment Setup	Yes	For each problem in our benchmark, we sample eight responses from the model and calculate ACC under the Zero-shot Chain-of-Thought (Co T) setting with the instruction of: Please reason step by step, and put your final answer within boxed. We utilize Qwen2.5-Math evaluators [35] to systematically assess solution correctness across all solutions, with each sampling conducted at a temperature setting of 0.7 to balance deterministic reasoning with exploration of solution paths.