Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Kinetics: Rethinking Test-Time Scaling Law

Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Beidi Chen

NeurIPS 2025 | Venue PDF | LLM Run Details | Input Tokens: 34,085 Total number of tokens sent to the LLM as input for this paper's analysis. | Output Tokens: 3,193 Total number of tokens produced by the LLM (including reasoning/thinking tokens) for this paper's analysis.

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 point gains in low-cost regimes and over 5 point gains in high-cost regimes for problem-solving accuracy on AIME and Live Code Bench.
Researcher Affiliation	Academia	Ranajoy Sadhukhan Zhuoming Chen Haizhong Zheng Yang Zhou Emma Strubell Beidi Chen Carnegie Mellon University, Pittsburgh, PA EMAIL
Pseudocode	Yes	Algorithm 1: Best-of-N optimal resource allocation under cost C Algorithm 2: Long-Co Ts optimal resource allocation under cost C
Open Source Code	Yes	Answer: [Yes] . Justification: We include our code in the supplementary material and plan to publicize the code in the future.
Open Datasets	Yes	We focus on three challenging reasoning benchmarks: AIME24 [59], AIME25 [60], math datasets spanning algebra, combinatorics, and geometry, and Live Code Bench [39], which includes complex programming problems from recent coding competitions.
Dataset Splits	Yes	For Live Code Bench, we sample 50 problems from the v5 subset (24 hard, 16 medium, 10 easy).
Hardware Specification	Yes	We utilize the specs from the latest and most powerful Nvidia B200 as the basis of our theoretical studies. This approach achieves up to a 25 wallclock speedup on H200 GPUs. We illustrate the benefit of block top-k attention across different model sizes on 8 H200 machines with an extremely large batch size of 4096.
Software Dependencies	No	We build our inference backend on Flashinfer [94], incorporating support for paged attention [46] and continuous batching [95]. Justification: The paper mentions using Flashinfer, paged attention, and continuous batching but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For Long-Co Ts, we fix N =1 in Equation (10) and vary n. For Best-of-N, we fix n=32,768, and estimate the solving rate (Pass@K) following the methodology of Brown et al. [4]. Empirically, we sweep over KV budgets {32, 64, 128, 256, 512, 1024}; reasoning trials {1, 2, 4, 8, 16, 32} (with a reduced upper limit for the 14B and 32B models to save computation time); and generation lengths {2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k, 28k, 30k, 32k}.