Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Kinetics: Rethinking Test-Time Scaling Law
Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Beidi Chen
NeurIPS 2025 | Venue PDF | LLM Run Details | Input Tokens: 34,085 Total number of tokens sent to the LLM as input for this paper's analysis. | Output Tokens: 3,193 Total number of tokens produced by the LLM (including reasoning/thinking tokens) for this paper's analysis.
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 point gains in low-cost regimes and over 5 point gains in high-cost regimes for problem-solving accuracy on AIME and Live Code Bench. |
| Researcher Affiliation | Academia | Ranajoy Sadhukhan Zhuoming Chen Haizhong Zheng Yang Zhou Emma Strubell Beidi Chen Carnegie Mellon University, Pittsburgh, PA EMAIL |
| Pseudocode | Yes | Algorithm 1: Best-of-N optimal resource allocation under cost C Algorithm 2: Long-Co Ts optimal resource allocation under cost C |
| Open Source Code | Yes | Answer: [Yes] . Justification: We include our code in the supplementary material and plan to publicize the code in the future. |
| Open Datasets | Yes | We focus on three challenging reasoning benchmarks: AIME24 [59], AIME25 [60], math datasets spanning algebra, combinatorics, and geometry, and Live Code Bench [39], which includes complex programming problems from recent coding competitions. |
| Dataset Splits | Yes | For Live Code Bench, we sample 50 problems from the v5 subset (24 hard, 16 medium, 10 easy). |
| Hardware Specification | Yes | We utilize the specs from the latest and most powerful Nvidia B200 as the basis of our theoretical studies. This approach achieves up to a 25 wallclock speedup on H200 GPUs. We illustrate the benefit of block top-k attention across different model sizes on 8 H200 machines with an extremely large batch size of 4096. |
| Software Dependencies | No | We build our inference backend on Flashinfer [94], incorporating support for paged attention [46] and continuous batching [95]. Justification: The paper mentions using Flashinfer, paged attention, and continuous batching but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For Long-Co Ts, we fix N =1 in Equation (10) and vary n. For Best-of-N, we fix n=32,768, and estimate the solving rate (Pass@K) following the methodology of Brown et al. [4]. Empirically, we sweep over KV budgets {32, 64, 128, 256, 512, 1024}; reasoning trials {1, 2, 4, 8, 16, 32} (with a reduced upper limit for the 14B and 32B models to save computation time); and generation lengths {2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k, 28k, 30k, 32k}. |