Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Provable Scaling Laws for the Test-Time Compute of Large Language Models

Authors: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.
Researcher Affiliation	Industry	Yanxi Chen Alibaba Group EMAIL Xuchen Pan Alibaba Group EMAIL Yaliang Li Alibaba Group EMAIL Bolin Ding Alibaba Group EMAIL Jingren Zhou Alibaba Group EMAIL
Pseudocode	Yes	See Algorithm 1 for a summary of this method.
Open Source Code	Yes	Our implementations can be found at https://github.com/pan-x-c/Agent Scope/tree/feature/ pxc/paper_provable/examples/paper_provable_scaling_law
Open Datasets	Yes	We use three datasets for our experiments: GPQA [33], MMLU-Pro [42] and MATH-500 [26].
Dataset Splits	No	Due to limited computational resources, we use a randomly sampled subset of 100 questions for each category of MMLU-Pro in our experiments, which leads to a total of 1400 questions; we refer to this subset as MMLU-Pro-S throughout this work. MATH-500 is a subset of 500 problems from the MATH dataset introduced in [22]. The paper describes how custom subsets were created for MMLU-Pro-S and MATH-500, but does not provide specific training/test/validation splits for these subsets or the GPQA dataset in its experiments.
Hardware Specification	No	This work involves a large number of experiments that were executed on different days and possibly on different machines, which makes it difficult to track the computer resources for each of them. We have provided detailed information about the datasets, LLMs and hyperparameters (e.g., N and K) for our experiments, which can be useful for estimating the amount of computer resources needed to reproduce the experiments.
Software Dependencies	No	The paper mentions using Agent Scope [9] but does not provide specific version numbers for it or any other key software dependencies like programming languages or libraries.
Experiment Setup	Yes	Throughout our experiments, the temperature for LLM decoding is set to 0.5 for the generation stage, and 0.1 for pairwise comparisons during the aggregation stage. Unless specified otherwise, for the knockout-style algorithm, we fix K = 4 for Llama3.1/ Qwen2.5/ Mixed, and K = 2 for GPT-4o/ Qw Q-32B; for the league-style algorithm, we consider a round-robin [46] version of it, with K = 4 comparisons conducted between each of N 2 pairs of initial candidates. We leverage zero-shot chain-of-thought prompting [18] for both generation and aggregation stages of the proposed algorithms.