Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficiently Scaling LLM Reasoning Programs with Certaindex

Authors: Yichao Fu, Junda Chen, Siqi Zhu, Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana S Rosing, Ion Stoica, Hao Helen Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To quantify real-world benefits, we built Certaindex as a scheduler into Dynasor, our reasoning-aware LLM serving system, and demonstrate up to 50% compute savings and 3.3 higher throughput in real workloads with no accuracy drop. Our evaluations on various datasets, LLMs, and reasoning algorithms show that in batch inference, it saves up to 50% compute to reach the same overall accuracy; and in online serving, it sustains up to 3.3 more queries or achieves 4.7 tighter latency SLOs at the same attainment rates.
Researcher Affiliation Collaboration 1UCSD 2Carnegie Mellon University 3Snowflake 4UC Berkeley
Pseudocode No No explicit pseudocode or algorithm blocks labeled 'Pseudocode' or 'Algorithm' were found in the paper. The paper describes workflows and architectures with figures, but not in a structured pseudocode format.
Open Source Code Yes Our code is available at https://github.com/hao-ai-lab/ Dynasor.git.
Open Datasets Yes Our evaluations on various datasets, LLMs, and reasoning algorithms show that in batch inference, it saves up to 50% compute to reach the same overall accuracy; and in online serving, it sustains up to 3.3 more queries or achieves 4.7 tighter latency SLOs at the same attainment rates. Our code is available at https://github.com/hao-ai-lab/ Dynasor.git. ... on diverse datasets [35; 36; 37; 28; 20; 38].
Dataset Splits No The paper mentions evaluating on various datasets (e.g., MATH500, AMC23, AIME24, GSM8K, Live Code Bench, ASDiv) and details configurations like maximum token budgets and resource caps per problem, but it does not explicitly state the train/test/validation splits for these datasets within the provided text.
Hardware Specification Yes All experiments run on a GPU cluster (Runpod) equipped with A100 (80GB) GPUs.
Software Dependencies Yes We build Dynasor on top of SGLang (version 0.3.3 post1).
Experiment Setup Yes For brevity, we provide detailed experimental setup in Appendix G. ... Table 1: Offline workload Configurations. ... Table 2: Online workload Configurations ... Table 3: Hyperparameter configurations for certaindex. ... For each interval, we varied the early termination parameter N (the required number of consecutive consistent answers), generating different points along each line.