Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Active Evaluation Acquisition for Efficient LLM Benchmarking

Authors: Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, Graham Horwood

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.
Researcher Affiliation	Industry	Yang Li 1 Jie Ma 1 Miguel Ballesteros 1 Yassine Benajiba 1 Graham Horwood 1 1Amazon Web Service. Correspondence to: Yang Li <EMAIL>.
Pseudocode	Yes	Please refer to Algorithm 1 for pseudo-code of the active acquisition process. (Also references Algorithm 2 in Appendix B and Algorithm 3 in Appendix C).
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository for the authors' implementation.
Open Datasets	Yes	We conduct experiments on five popular LLM benchmarks: Hugging Face Open LLM Leaderboard (Beeching et al., 2023), MMLU (Hendrycks et al., 2020), HELM-Lite (Liang et al., 2022), Alpaca Eval 2.0 (Li et al., 2023), and Chatbot Arena (Zheng et al., 2024).
Dataset Splits	Yes	For each benchmark, we divide its leaderboard into training and testing splits based on models rather than prompts. When evaluation dates are available (e.g., for Open LLM Leaderboard), we use chronological splitting to ensure testing is performed on newer models, better simulating real-world scenarios where we evaluate newly released LLMs. For benchmarks without temporal information, we use random splitting. For Alpaca Eval, we collect evaluation scores for 130 models and randomly select 70% for training.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments. It only generally states that the models 'can be executed efficiently on modest hardware with minimal computational resources'.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions) that were used to replicate the experiment.
Experiment Setup	Yes	Appendix B.4, titled 'Hyperparameters', provides Table B.1 which summarizes the hyperparameters used for the neural process model for each dataset. Appendix D also includes Table D.1, 'Hyperparameters for RL-based acquisition policy'.