Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Provable Scaling Laws for the Test-Time Compute of Large Language Models
Authors: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms. |
| Researcher Affiliation | Industry | Yanxi Chen Alibaba Group EMAIL Xuchen Pan Alibaba Group EMAIL Yaliang Li Alibaba Group EMAIL Bolin Ding Alibaba Group EMAIL Jingren Zhou Alibaba Group EMAIL |
| Pseudocode | Yes | See Algorithm 1 for a summary of this method. |
| Open Source Code | Yes | Our implementations can be found at https://github.com/pan-x-c/Agent Scope/tree/feature/ pxc/paper_provable/examples/paper_provable_scaling_law |
| Open Datasets | Yes | We use three datasets for our experiments: GPQA [33], MMLU-Pro [42] and MATH-500 [26]. |
| Dataset Splits | No | Due to limited computational resources, we use a randomly sampled subset of 100 questions for each category of MMLU-Pro in our experiments, which leads to a total of 1400 questions; we refer to this subset as MMLU-Pro-S throughout this work. MATH-500 is a subset of 500 problems from the MATH dataset introduced in [22]. The paper describes how custom subsets were created for MMLU-Pro-S and MATH-500, but does not provide specific training/test/validation splits for these subsets or the GPQA dataset in its experiments. |
| Hardware Specification | No | This work involves a large number of experiments that were executed on different days and possibly on different machines, which makes it difficult to track the computer resources for each of them. We have provided detailed information about the datasets, LLMs and hyperparameters (e.g., N and K) for our experiments, which can be useful for estimating the amount of computer resources needed to reproduce the experiments. |
| Software Dependencies | No | The paper mentions using Agent Scope [9] but does not provide specific version numbers for it or any other key software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | Throughout our experiments, the temperature for LLM decoding is set to 0.5 for the generation stage, and 0.1 for pairwise comparisons during the aggregation stage. Unless specified otherwise, for the knockout-style algorithm, we fix K = 4 for Llama3.1/ Qwen2.5/ Mixed, and K = 2 for GPT-4o/ Qw Q-32B; for the league-style algorithm, we consider a round-robin [46] version of it, with K = 4 comparisons conducted between each of N 2 pairs of initial candidates. We leverage zero-shot chain-of-thought prompting [18] for both generation and aggregation stages of the proposed algorithms. |