Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Authors: Sangwoo Park, Matteo Zecchin, Osvaldo Simeone

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, for prompt design in LLMs, and for test-time reasoning budget allocation in LLMs confirm the reliability and efficiency of R-Auto Eval+. (Abstract) For experimental validation, we consider three model selection applications: 1) selecting the lightest quantized LLM with guaranteed performance drop as compared to the baseline model on the Trivia QA data set [28]; 2) selecting the shortest prompt template for an LLM with guaranteed accuracy on the Instruct-Induction task [27]; and 3) test-time reasoning budget allocation with guaranteed performance enhancement on the GSM8K data set [13].
Researcher Affiliation Academia Sangwoo Park Matteo Zecchin Osvaldo Simeone Department of Engineering King s College London London, United Kingdom EMAIL
Pseudocode Yes The overall procedure of R-Auto Eval+ is summarized in Algorithm 1 (Appendix B).
Open Source Code Yes Code is available at https://github.com/kclip/R_Auto Eval_plus. (Section 4 footnote)
Open Datasets Yes Trivia QA data set [28]; Instruct-Induction task [27]; and GSM8K data set [13]; Co QA data set [39]
Dataset Splits Yes We set S = 10 with ρs being uniformly spaced in the range [0, 1] and choose initial weights as ws,0 = 1/S. We set δ = 0.1, α = 0.1, n = 150, and r = 5 for Fig. 1 while vary n from 100 to 300 with r = 3 for Fig. 4. We set δ = 0.1, n = 200, r = 9, with α chosen as the minimum value in the set {0.05, 0.1, ..., 0.95} for which R-Auto Eval [20] finds at least one reliable prompt template with the strongest autoevaluator. We set δ = 0.1, n = 1000, r = 4, and α = 0.03.
Hardware Specification Yes All the results in this section are reported after averaging over 100 independent experiments, and 2 H100 GPUs are used for LLM executions.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers (e.g., Python, PyTorch versions). It refers to LLMs as models being evaluated or autoevaluators, and mentions statistical methods, but not software libraries/frameworks with versions used for implementation.
Experiment Setup Yes We set S = 10 with ρs being uniformly spaced in the range [0, 1] and choose initial weights as ws,0 = 1/S. We set δ = 0.1, α = 0.1, n = 150, and r = 5 for Fig. 1 while vary n from 100 to 300 with r = 3 for Fig. 4. We set δ = 0.1, n = 200, r = 9, with α chosen as the minimum value in the set {0.05, 0.1, ..., 0.95} for which R-Auto Eval [20] finds at least one reliable prompt template with the strongest autoevaluator. We set δ = 0.1, n = 1000, r = 4, and α = 0.03.